0% found this document useful (0 votes)
29 views54 pages

Association

This document provides an overview of Association Rule Mining, focusing on its application in market basket analysis to understand customer purchasing behavior. It discusses the importance of support and confidence metrics in evaluating association rules, as well as the computational challenges involved in mining these rules from transaction data. The document also introduces the Apriori algorithm as a method for frequent itemset generation, emphasizing its efficiency in reducing the number of candidates through the use of prior knowledge about itemset properties.

Uploaded by

saianjani.1025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views54 pages

Association

This document provides an overview of Association Rule Mining, focusing on its application in market basket analysis to understand customer purchasing behavior. It discusses the importance of support and confidence metrics in evaluating association rules, as well as the computational challenges involved in mining these rules from transaction data. The document also introduces the Apriori algorithm as a method for frequent itemset generation, emphasizing its efficiency in reducing the number of candidates through the use of prior knowledge about itemset properties.

Uploaded by

saianjani.1025
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Unit-6

Association Rule Mining


Introduction

 Many business enterprises accumulate large quantities of


data from their day-to-day operations.
 For example, Grocery stores/Retail stores
 Market basket transactions

TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Introduction
 Data required to learn about the purchasing behavior of
customers.
 Useful for marketing promotions, inventory management,
and customer relationship management.
 Association analysis, is useful for discovering interesting
relationships hidden in large data sets.
 Relationships are represented as Association rules or set
of frequent items.
{Diapers} ~ {Beer}
 The purchasing of one product when another product is
purchased represents an association rule.
Market Basket Analysis

 One basket tells you about what


one customer purchased at one
time.

 A loyalty card makes it possible


to tie together purchases by a
single customer (or household)
over time
Market Basket Analysis
 Retail – each customer purchases different set of products,
different quantities, different times
 Retailers use this information to:
 Identify who customers are (not by name)
 Understand why they make certain purchases
 Gain insight about its merchandise (products)
• Fast and slow movers
• Products which are purchased together
• Products which might benefit from promotion
 Take action:
• Store layouts
• Which products to put on specials, promote, coupons…
 Combining all of this with a customer loyalty card it becomes
even more valuable
Market Basket Analysis
 Association rules can be applied on other types of “baskets.”
 Items purchased on a credit card, such as rental cars and hotel rooms,
provide insight into the next product that customers are likely to
purchase,
 Optional services purchased by telecommunications customers (call
waiting, call forwarding, DSL, speed call, and so on) help determine
how to bundle these services together to maximize revenue.
 Banking products used by retail customers (money market accounts,
certificate of deposit, investment services, car loans, and so on)
identify customers likely to want other products.
 Unusual combinations of insurance claims can be a sign of fraud and
can spark further investigation.
 Medical patient histories can give indications of likely complications
based on certain combinations of treatments.
What is Association Rule Mining

 Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions Example of Association Rules


TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
How can Association rules be used?
What is Association Rule Mining

 Rule form
Antecedent Consequent [support, confidence]
(support and confidence are user defined measures of interestingness)

 Let the rule discovered be {Bread,...} {Potato Chips}


 Potato chips as consequent => Can be used to determine what
should be done to boost its sales
 Bread in the antecedent => Can be used to see which
products would be affected if the store discontinues selling
bread
 Bread in antecedent and Potato chips in the consequent =>
Can be used to see what products should be sold with Bread
to promote sale of Potato Chips
Association Rule Notation
Basic concepts

 Given:
 (1) database of transactions,
 (2) each transaction is a list of items purchased by a
customer in a visit

 Find:
 all rules that correlate the presence of one set of items
(itemset) with that of another set of items
 E.g., 35% of people who buys salmon also buys cheese
The model: data

 I = {i1, i2, …, im}: a set of items


 Transaction t :
t a set of items, and t  I
 Transaction Database T: a set of transactions

T = {t1, t2, …, tn}


Transaction data: Supermarket data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket
 I: the set of all items sold in the store
 A transaction: items purchased in a basket; it may have
TID (transaction ID)
 A transactional dataset: A set of transactions
Definitions
 Itemset TID Items
 A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Beer, Eggs
 k-itemset 3 Milk, Diaper, Beer, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
 Support count ()
 Frequency of occurrence of an itemset
 E.g. ({Milk, Bread, Diaper}) = 2

 Frequent Itemset
 An itemset whose support is greater than or
equal to a minsup threshold
Definition: Association Rule
 Association Rule TID Items

– An implication expression of the form 1 Bread, Milk


X  Y, where X and Y are itemsets 2 Bread, Diaper, Beer, Eggs
– Example: 3 Milk, Diaper, Beer, Coke
{Milk, Diaper}  {Beer} 4 Bread, Milk, Diaper, Beer
 Rule Evaluation Metrics 5 Bread, Milk, Diaper, Coke
– Support (s)
 Fraction of transactions that Example:
contain both X and Y {Milk, Diaper}  Beer
– Confidence (c)
 denotes the percentage of  (Milk, Diaper, Beer) 2
s  0.4
transactions containing A which |T| 5
also contain B.
 (Milk, Diaper, Beer) 2
 c=Sup(A,B)/Sup(A) c  0.67
 (Milk , Diaper) 3
Example

Support Confidence
Calculation Calculation
a. 3/5=0.6 a. 3/4= 0.75
b. 3/5=0.6 b. 3/3=1
c. 1/5=0.2 c. 1/2 = 0.5
d. 1/5=0.2 d. 1/3 = 0.33
e. 1/5=0.2 e. 1/1=1
f. 0 f. 0
Example
Why Support and Confidence
 Support
 is an important measure because a rule that has very low support may
occur simply by chance.
 A low support rule is also likely to be uninteresting from a business
perspective because it may not be profitable to promote items that
customers seldom by together.
 For these reasons, support is often used to eliminate uninteresting
rules.
 Confidence,
 measures the reliability of the inference made by a rule.
 For a given rule X ~ Y, tbe higher tbe confidence, the more likely it is
for Y to be present in transactions that contain X.
 Confidence also provides an estimate of the conditional probability of
Y given X.
Association Rule Mining Problem

 Given a set of transactions T, the goal of association rule


mining is to find all rules having
 support ≥ minsup threshold
 confidence ≥ minconf threshold

where minsup and minconf are the corresponding support and confidence
thresholds.
 Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf thresholds

 Computationally prohibitive!
Computational Complexity
 Given d unique items:
 Total number of itemsets = 2d
 Total number of possible association rules:

 d 
d1  d  k 
d k
R       
 k   j 
k 1 j 1

3  2  1
d d 1

If d=6, R = 602 rules


Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk {Milk,Diaper}  {Beer} (s=0.4, c=0.67)
2 Bread, Diaper, Beer, Eggs {Milk,Beer}  {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Beer, Coke {Diaper,Beer}  {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
Mining Association Rules

Two-step approach:
 Frequent Itemset Generation
• Generate all itemsets whose support  minsup.
• These itemsets are called frequent itemsets.
 Rule Generation
• Generate high confidence rules from each frequent
itemset.
• These rules are called strong rules.

Frequent itemset generation is still computationally expensive.


Frequent Itemset Generation
Given d items, there are 2d
null possible candidate itemsets

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE
Frequent Itemset Generation
 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset
 Count the support of each candidate by scanning the database
Transactions List of
Candidates
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke

 Match w
each transaction against every candidate
 If the candidate is contained in a transaction, its support count
will be incremented.
 Complexity ~ O(NMw) => Expensive since M = 2d !!!
Frequent Itemset Generation Strategies

 Reduce the number of candidates (M)


 Complete search: M=2d
 Use pruning techniques to reduce M

 Reduce the number of transactions (N)


 Reduce size of N as the size of itemset increases
 Used by DHP and vertical-based mining algorithms

 Reduce the number of comparisons (NM)


 Use efficient data structures to store the candidates or
transactions
 No need to match every candidate against every transaction
Reducing Number of Candidates
 Apriori algorithm:
 for finding frequent itemsets in a dataset
 Name of the algorithm is Apriori because it uses prior
knowledge of frequent itemset properties.
 We apply an iterative approach or level-wise search where k-
frequent itemsets are used to find k+1 itemsets.
Reducing Number of Candidates
 Apriori principle:
 If an itemset is frequent, then all of its subsets must also be
frequent
If an itemset is infrequent, all its supersets will be infrequent.

 A transaction containing {beer, diaper, nuts} also contains


{beer, diaper}
 {beer, diaper, nuts} is frequent {beer, diaper} must also be
frequent
Reducing Number of Candidates
 Apriori principle:
 If an itemset is frequent, then all of its subsets must also be
frequent
If an itemset is infrequent, all its supersets will be infrequent.
 Apriori principle holds due to the following property of
the support measure:

X , Y : ( X  Y )  s( X ) s(Y )
 Support of an itemset never exceeds the support of its subsets
 This is known as the anti-monotone property of support
Illustrating Apriori Principle
Illustrating Apriori Principle
null

If an itemset is
infrequent, then all of A B C D E

its supersets must also


be infrequent
AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets
Example

Consider the following dataset and we will find frequent


itemsets and generate association rules for them.

minimum support count is 2


minimum confidence is 60%
Example

 Step-1: K=1
(I) Create a table containing support count of each item
present in dataset – Called C1(candidate set)

 (II) compare candidate set item’s support count with


minimum support count. This gives us itemset L1.
Example
 Step-2: K=2
 Generate candidate set C2 using L1 (this is
called join step). Condition of joining Lk-1 and
Lk-1 is that it should have (K-2) elements in
common.
 Check all subsets of an itemset are frequent or
not and if not frequent remove that itemset.
(Example subset of{I1, I2} are {I1}, {I2} they
are frequent.Check for each itemset)
 Now find support count of these itemsets by
searching in dataset.
Example
 II) compare candidate (C2) support count with minimum support count
 this gives us itemset L2.
Example
 Step-3:
 Generate candidate set C3 using L2 (join step).
Condition of joining Lk-1 and Lk-1 is that it should
have (K-2) elements in common. So here, for L2,
first element should match.
So itemset generated by joining L2 is {I1, I2, I3}
{I1, I2, I5}{I1, I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3,
I5}
 Check if all subsets of these itemsets are frequent or
not and if not, then remove that itemset.(Here
subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3}
which are frequent. For {I2, I3, I4}, subset {I3, I4}
is not frequent so remove it. Similarly check for
every itemset)
 find support count of these remaining itemset by
searching in dataset.
Example
 (II) Compare candidate (C3) support count with minimum support count
 this gives us itemset L3.

 Step-4:
 Generate candidate set C4 using L3 (join step). Condition of joining L k-
1 and Lk-1 (K=4) is that, they should have (K-2) elements in common.
So here, for L3, first 2 elements (items) should match.
 Check all subsets of these itemsets are frequent or not (Here itemset
formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3,
I5}, which is not frequent). So no itemset in C4
 We stop here because no frequent itemsets are found further
Example
 We have discovered all the frequent item-sets.
 Now generation of strong association rule comes into picture.
 For that we need to calculate confidence of each rule.
 Confidence –
A confidence of 60% means that 60% of the customers, who
purchased milk and bread also bought butter.
 Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Example
 Itemset {I1, I2, I3} //from L3
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) =
2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) =
2/4*100=50%
[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) =
2/4*100=50%
[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%
[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%
[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
 So if minimum confidence is 50%, then first 3 rules can be considered
as strong association rules.
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
Item Count
2 Beer, Bread, Diaper, Eggs
Bread 4
3 Beer, Coke, Diaper, Milk Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
Beer 3
5 Bread, Coke, Diaper, Milk Diaper 4
Eggs 1

Minimum Support = 2

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
TID Items
Items (1-itemsets)
1 Bread, Milk
2 Beer, Bread, Diaper, Eggs Item Count
Bread 4
3 Beer, Coke, Diaper, Milk
Coke 2
4 Beer, Bread, Diaper, Milk Milk 4
5 Bread, Coke, Diaper, Milk Beer 3
Diaper 4
Eggs 1

Minimum Support = 2

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Pairs (2-itemsets)
Beer 3 {Bread,Milk}
Diaper 4 (No need to generate
{Bread, Beer }
Eggs 1
{Bread,Diaper} candidates involving Coke
{Beer, Milk} or Eggs)
{Diaper, Milk}
{Beer,Diaper}
Minimum Support = 2

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Beer, Bread} 2 (No need to generate
Eggs 1 {Bread,Diaper} 3 candidates involving Coke
{Beer,Milk} 2
{Diaper,Milk} 3 or Eggs)
{Beer,Diaper} 3
Minimum Support =2

If every subset is considered,


6
C1 + 6C2 + 6C3
6 + 15 + 20 = 41
With support-based pruning,
6 + 6 + 4 = 16
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 2
Triplets (3-itemsets)
If every subset is considered,
Itemset
6
C1 + 6C2 + 6C3
{ Beer, Diaper, Milk}
6 + 15 + 20 = 41 { Beer,Bread,Diaper}
With support-based pruning, {Bread, Diaper, Milk}
6 + 6 + 4 = 16 { Beer, Bread, Milk}
Illustrating Apriori Principle
Item Count Items (1-itemsets)
Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3
Diaper 4 {Bread,Beer} 2 (No need to generate
Eggs 1
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
Minimum Support = 2
Triplets (3-itemsets)
If every subset is considered, Itemset Count
6
C1 + 6C2 + 6C3 { Beer, Diaper, Milk} 2
6 + 15 + 20 = 41 { Beer,Bread, Diaper} 2
With support-based pruning, {Bread, Diaper, Milk} 2
6 + 6 + 4 = 16 {Beer, Bread, Milk} 1
6 + 6 + 1 = 13
Apriori Algorithm
 Method:
 Let k=1
 Generate frequent itemsets of length 1
 Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k
frequent itemsets
• Prune candidate itemsets containing subsets of length k
that are infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only
those that are frequent
Generating AR from frequent itemsets

 Confidence

 For every frequent itemset x, generate all non-empty


subsets of x

 For every non-empty subsets of x, output the rule


The Apriori Algorithm — Example
The Apriori Algorithm — Example
(Contd.)
 Frequent Item set = {2,3,5}
 Rules are: Association Confidence Confidence %
Rule
2^3 →5 2/2=1 100%
2^5→3 2/3=0.6 60%
3^5→2 2/2=1 100%
5→2^3 2/3=0.6 60%
3→2^5 2/3=0.6 60%
2→3^5 2/3=0.6 60%

 If the minimum confidence threshold is 70%, then the


only strong rules are: 2^3 →5 & 3^5→2
The Apriori Algorithm—An Example
Supmin = 2
Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
{C} 3
20 B, C, E 1st scan {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
49 {B, C, E} 2
{B, C, E}
Is Apriori Fast Enough? —
Performance Bottlenecks
 The core of the Apriori algorithm:
 Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
 Use database scan and pattern matching to collect counts for
the candidate itemsets
 The bottleneck of Apriori: Candidate generation
 Huge candidate sets
 Multiple scans of database
Problems with the association mining

 Rare Item Problem: It assumes that all items in the data are
of the same nature and/or have similar frequencies.

 Not true: In many applications, some items appear very


frequently in the data, while others rarely appear.

E.g., in a supermarket, people buy food processor and


cooking pan much less frequently than they buy bread and
milk.
Interestingness Measurements

 How good is the association Rule?


 Are all of the strong association rules discovered
interesting enough to present to the user?
 How can we measure the interestingness of a rule?
 Subjective measures
 A rule (pattern) is interesting if
• it is unexpected (surprising to the user); and/or
• actionable (the user can do something with it)
• (only the user can judge the interestingness of a rule)
Apriori Advantages &
Disadvantages
 Advantages:
 Uses large itemset property.
 Easily parallelized
 Easy to implement.
 Disadvantages:
 Assumes transaction database is memory resident.
 Requires up to m database scans.
Thank You

You might also like