6 - Association Rules - For Students
6 - Association Rules - For Students
Association Rules
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
2
Overview
• Association rules Market Basket Analysis
3
Overview
• Some questions that association rules
can answer:
• Which products tend to be purchased
together?
4
Overview
The general logic behind association rules:
5
Overview
Rules
• Each rule is in the form X → Y
• Means when Item X is observed, Item Y is also observed.
Itemset
• A collection of items or individual entities that contain
some kind of relationship.
• An itemset containing k items is called a k-itemset.
• k-itemset = {item 1, item 2,…,item k}
• Examples:
• A set of retail items purchased together in one transaction.
• A set of hyperlinks clicked on by one user in a single session.
7
Overview
1-itemsets
Apriori Algorithm
• The most fundamental algorithms for generating association rules. 2-itemsets
Frequent Itemset
• Items that appear together “often enough” (i.e. meets the minimum support criterion).
• If the minimum support is set at 0.7, {bread} is considered a frequent itemset; whereas {bread,
butter} is not considered as a frequent itemset.
8
Overview
Apriori Property
Frequent
• Also called downward closure property. Itemset
10
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
11
Apriori Algorithm
Creating Frequent Sets
• Apriori employs an iterative approach known as a level-wise search, where
k-itemsets are used to explore (k+1)-itemsets.
• Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can be found.
12
Example 1
L1
Item Count Items (1-itemsets) (5 transactions ; 6 types of items)
Bread 4
Coke 2 L2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
{Bread,Milk} 3
Diaper 4
Eggs 1 {Bread,Beer} 2 (No need to generate
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
L3
Itemset Count
{Bread,Milk,Diaper} 2 Triplets (3-itemsets)
MinSupp = 3/5=0.6 {Bread,Milk,Beer} 1
{Bread,Diaper,Beer} 2
13
Apriori Algorithm
Creating Frequent Sets
• Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k
14
Apriori Algorithm
Illustrating Apriori Principle
• Any subset of a
frequent itemset must
also be frequent.
15
Apriori Algorithm
Pseudo Code
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
16
Example 2
17
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
18
Evaluation of Candidate Rules
The process of creating association rules is two-staged.
• First, a set of candidate rule based on frequent itemsets is generated.
• If {Bread, Egg, Milk, Butter} is the frequent itemset, candidate rules will look like:
• {Egg, Milk, Butter} → {Bread}
• {Bread, Milk, Butter} → {Egg}
• {Bread, Egg} → {Milk, Butter}
• Etc.
Support( X ∪ Y )
Confidence( X → Y ) =
support( X )
• E.g. if {bread, eggs, milk} has support of 0.15 and {bread, eggs} also has a support of
0.15, the confidence of rule {bread, eggs} → {milk} is 1.
• This means 100% of the time a customer buys bread and eggs, milk is brought as well. The rule is
therefore correct for 100% of the transactions containing bread and eggs.
20
Evaluation of Candidate Rules
{Toothbrush}→{Milk}:
Confidence Confidence = 10/(10+4) = 0.7
21
https://towardsdatascience.com/association-rules-2-aa9a77241654
Evaluation of Candidate Rules
Lift
• Measures how many times more often X and Y occur together than expected if they are
statistically independent of each other.
• A measure of how X and Y are really related rather than coincidentally happening
together.
support( X ∪ Y )
Lift( X ⇒ Y ) =
support( X ) ∗support( Y )
• Therefore it can be concluded that milk and bread have a stronger association
than milk and eggs.
23
Evaluation of Candidate Rules
Leverage
• Measure the difference in the probability of X and Y appearing together in the
dataset compared to what would be expected if X and Y were statistically
independent of each other.
24
Evaluation of Candidate Rules
Leverage
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the
transactions, then Leverage (milk → bread) = 0.4 - 0.5*0.4 = 0.2
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions.
milk and bread have a stronger association than milk and eggs.
26
Evaluation of Candidate Rules
• Confidence is able to identify trustworthy rules, but it cannot tell
whether a rule is coincidental.
• Measures such as lift and leverage not only ensure interesting rules
are identified but also filter out the coincidental rules.
27
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
28
Example 3
29
Example 4
TID List of Items
• Consider a database, D , consisting of 9 transactions.
T100 I1, I2, I5
T105 I2, I3 • We have first to find out the frequent itemsets using
T106 I1, I3 Apriori algorithm.
T107 I1, I2 ,I3, I5
T108 I1, I2, I3 • Then, Association rules will be generated using min.
support & min. confidence.
30
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
31
Applications of Association Rules
The term market basket analysis refers to a specific implementation of
association rules.
• For better merchandising – products to include/exclude from
inventory each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives
managed through a loyalty card program
32
Applications of Association Rules
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
33
Applications of Association Rules
Recommender systems – Amazon, Netflix:
• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F
In medicine:
• relationships between symptoms and illnesses;
• diagnosis and patient characteristics and treatments (to be used in
medical DSS);
• genes and their functions (to be used in genomics projects)..
34
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
35
Validation and Testing
• The frequent and high confidence itemsets are found by pre-specified minimum support
and minimum confidence levels
• Measures like lift and/or leverage then ensure that interesting rules are identified rather
than coincidental ones
• Good rules provide valuable insights for institutions to improve their business operations
36
Outline
Overview
Apriori Algorithm
Examples
Diagnostic
37
Diagnostics
• Although the Apriori algorithm is easy to understand and implement, some
of the rules generated are uninteresting or practically useless.
• Measures like confidence, lift, and leverage should be used along with
human insights to address this problem.
38
Diagnostics
• Another problem with association rules is that, in Phase 3 and 4 of the Data
Analytics Lifecycle , the team must specify the minimum support prior to
the model execution, which may lead to too many or too few rules.
39
Diagnostics
Approaches to improve Apriori’s efficiency:
Partitioning:
• Any itemset that is potentially frequent in a transaction database must be frequent in at least one of the partitions of the
transaction database.
Sampling:
• This extracts a subset of the data with a lower support threshold and uses the subset to perform association rule mining.
Transaction reduction:
• A transaction that does not contain frequent k-itemsets is useless in subsequent scans and therefore can be ignored.
40