0% found this document useful (0 votes)
33 views39 pages

6 - Association Rules - For Students

The document discusses association rules, particularly in the context of market basket analysis, which is an unsupervised learning method used to discover relationships in large datasets. It covers the Apriori algorithm for generating frequent itemsets and evaluates candidate rules using metrics like support, confidence, lift, and leverage. Applications of association rules include improving merchandising, product placement, and promotional strategies based on customer purchasing behavior.

Uploaded by

minhhoanghon04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views39 pages

6 - Association Rules - For Students

The document discusses association rules, particularly in the context of market basket analysis, which is an unsupervised learning method used to discover relationships in large datasets. It covers the Apriori algorithm for generating frequent itemsets and evaluates candidate rules using metrics like support, confidence, lift, and leverage. Applications of association rules include improving merchandising, product placement, and promotional strategies based on customer purchasing behavior.

Uploaded by

minhhoanghon04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Lecture 6:

Association Rules
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
2
Overview
• Association rules Market Basket Analysis

• An unsupervised learning method.

• A descriptive, not predictive, method.

• Used to discover interesting hidden


relationships in a large dataset.

• The disclosed relationships are represented


as rules or frequent itemsets.
>
• Commonly used for mining transactions in
database.

3
Overview
• Some questions that association rules
can answer:
• Which products tend to be purchased
together?

• Of those customers who are similar to this


person, what products do they tend to buy? To analyze customer
buying habits by finding
associations and
• Of those customers who have purchased this correlations between the
different items that
product, what other similar products do they customers place in their
tend to view or purchase? shopping baskets.

4
Overview
The general logic behind association rules:

1. A large collection of transactions (depicted as


three stacks of receipts, in which each transaction
consists of one or more items).

2. Association rules go through the items being


purchased to see what items are frequently
bought together and to discover a list of rules that
describe the purchasing behavior.

3. The rules suggest that when cereal is purchased,


90% of the time milk is purchased; when bread is
purchased, 40% of the time milk is purchased
also; when milk is purchased, 23% of the time
cereal is also purchased.

5
Overview
Rules
• Each rule is in the form X → Y
• Means when Item X is observed, Item Y is also observed.

Itemset
• A collection of items or individual entities that contain
some kind of relationship.
• An itemset containing k items is called a k-itemset.
• k-itemset = {item 1, item 2,…,item k}
• Examples:
• A set of retail items purchased together in one transaction.
• A set of hyperlinks clicked on by one user in a single session.

7
Overview
1-itemsets
Apriori Algorithm
• The most fundamental algorithms for generating association rules. 2-itemsets

• One major component of Apriori is support.


• Given an Itemset L, the support of L is the percentage of transactions that contain L.
• If 80% of all transactions contain itemset {bread}, then the support of {bread} is 0.8.
• If 60% of all transactions contain itemset {bread, butter}, then the support of {bread, butter} is 0.6.

Frequent Itemset
• Items that appear together “often enough” (i.e. meets the minimum support criterion).
• If the minimum support is set at 0.7, {bread} is considered a frequent itemset; whereas {bread,
butter} is not considered as a frequent itemset.

8
Overview
Apriori Property
Frequent
• Also called downward closure property. Itemset

• If an item is considered frequent, then any subset of the


frequent itemset must also be frequent.
• If 60% of the transactions contain {bread, jam}, then at least
60% of all the transactions will contain {bread} or {jam}.
• If the support of {bread, jam} is 0.6, the support of {bread}
or {jam} is at least 0.6.

• If itemset {B, C, D} is frequent, then all the subset of this


itemset, shaded in the figure, must also be frequent
itemsets.
9
Overview
Association Rules Rule evaluation Metrics
• An implication expression of the form X • Support (s): No. of transactions that contain
→ Y, where X and Y are non-overlapping both X and Y out of total no. of transactions.
itemsets.
• E.g. A support of 2% means that 2% of all the
• E.g. {Milk, Diaper} → {Beer} transactions under analysis show that {Milk,
Diaper} and {Beer} are purchased together.
• Generating association rules:
• Step 1: Find frequent itemsets whose • Confidence (c): No. of transactions that
occurrences exceed a predefined contain both X and Y out of total no. of
minimum support threshold. transactions that contains X.
• Step 2: Derive association rules from
those frequent itemsets (with the • E.g. A confidence of 60% means that 60% of
constrains of minimum confidence customers who purchased {Milk, Diaper} also
threshold). bought {Beer}

10
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
11
Apriori Algorithm
Creating Frequent Sets
• Apriori employs an iterative approach known as a level-wise search, where
k-itemsets are used to explore (k+1)-itemsets.

• First, the set of frequent 1-itemsets is found by scanning the database to


accumulate the count for each item, and collecting those items that satisfy
minimum support. The resulting set is denoted L1.

• Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3,
and so on, until no more frequent k-itemsets can be found.

• The finding of each Lk requires one full scan of the database.

12
Example 1
L1
Item Count Items (1-itemsets) (5 transactions ; 6 types of items)
Bread 4
Coke 2 L2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3
{Bread,Milk} 3
Diaper 4
Eggs 1 {Bread,Beer} 2 (No need to generate
{Bread,Diaper} 3 candidates involving Coke
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3
{Beer,Diaper} 3
L3
Itemset Count
{Bread,Milk,Diaper} 2 Triplets (3-itemsets)
MinSupp = 3/5=0.6 {Bread,Milk,Beer} 1
{Bread,Diaper,Beer} 2

13
Apriori Algorithm
Creating Frequent Sets

• Let’s define:
Ck as a candidate itemset of size k
Lk as a frequent itemset of size k

• Main steps of iteration are:


1. Find frequent itemset Lk-1 (starting from L1)
2. Join step: Ck is generated by joining Lk-1 with itself (cartesian product Lk-1 x Lk-1)
3. Prune step (Apriori Property): Any (k−1) size itemset that is not frequent cannot be a subset of
a frequent k size itemset, hence should be removed from Ck
4. Frequent set Lk has been achieved

14
Apriori Algorithm
Illustrating Apriori Principle

• Any subset of a
frequent itemset must
also be frequent.

• Itemsets that do not


meet the minimum
support threshold are
pruned away.

15
Apriori Algorithm
Pseudo Code
L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;

16
Example 2

17
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
18
Evaluation of Candidate Rules
The process of creating association rules is two-staged.
• First, a set of candidate rule based on frequent itemsets is generated.
• If {Bread, Egg, Milk, Butter} is the frequent itemset, candidate rules will look like:
• {Egg, Milk, Butter} → {Bread}
• {Bread, Milk, Butter} → {Egg}
• {Bread, Egg} → {Milk, Butter}
• Etc.

• Second, the appropriateness of these candidate rules are evaluated using:


• Confidence
• Lift
• Leverage
19
Evaluation of Candidate Rules
Confidence
• The measures of certainty or trustworthiness associated with each discovered rule.
• Mathematically, the percent of transactions that contain both X and Y out of all the
transactions that contain X.

Support( X ∪ Y )
Confidence( X → Y ) =
support( X )

• E.g. if {bread, eggs, milk} has support of 0.15 and {bread, eggs} also has a support of
0.15, the confidence of rule {bread, eggs} → {milk} is 1.
• This means 100% of the time a customer buys bread and eggs, milk is brought as well. The rule is
therefore correct for 100% of the transactions containing bread and eggs.
20
Evaluation of Candidate Rules
{Toothbrush}→{Milk}:
Confidence Confidence = 10/(10+4) = 0.7

• A relationship may be thought of as interesting when the algorithm identifies the


relationship with a measure of confidence greater than or equal to the predefined
threshold (i.e. the minimum confidence).

• Problem with Confidence:


• Given a rule X → Y, confidence considers only the antecedent (X) and the co-
occurrence of X and Y.
• Cannot tell if a rule contains true implication of the relationship or if the rule
is purely coincidental.

21
https://towardsdatascience.com/association-rules-2-aa9a77241654
Evaluation of Candidate Rules
Lift
• Measures how many times more often X and Y occur together than expected if they are
statistically independent of each other.
• A measure of how X and Y are really related rather than coincidentally happening
together.

support( X ∪ Y )
Lift( X ⇒ Y ) =
support( X ) ∗support( Y )

• Lift = 1 if X and Y are statistically independent


• Lift > 1 indicates the degree of usefulness of the rule
• A larger value of lift suggests a greater strength of the association between X and Y.
22
Evaluation of Candidate Rules
Lift

• E.g. Assuming 1000 transactions,


• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions, then
Lift(milk → eggs) = 0.3 / (0.5 * 0.4) = 1.5
• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions, then
Lift(milk → bread) = 0.4/(0.5*0.4) = 2.0

• Therefore it can be concluded that milk and bread have a stronger association
than milk and eggs.

23
Evaluation of Candidate Rules
Leverage
• Measure the difference in the probability of X and Y appearing together in the
dataset compared to what would be expected if X and Y were statistically
independent of each other.

Leverage( X → Y ) = Support( X ∪ Y ) - Support( X ) ∗ Support( Y )

• Leverage = 0 if X and Y are statistically independent


• Leverage > 0 indicates the degree of relationship between X and Y,
• A larger leverage value indicates a stronger relationship between X and Y.

24
Evaluation of Candidate Rules
Leverage

• E.g. Assuming 1000 transactions,


• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the
transactions, then Leverage(milk → eggs) = 0.3 - 0.5*0.4 = 0.1

• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the
transactions, then Leverage (milk → bread) = 0.4 - 0.5*0.4 = 0.2

• It again confirms that milk and bread have a stronger association


than milk and eggs.
25
Summary
• Assuming 1000 transactions,
• If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in 400 of the transactions.

• If {milk, bread} appears in 400, {milk} in 500, and {bread} in 400 of the transactions.

{Milk} → {Egg} {Milk} → {Bread}


Confidence 0.3ൗ 0.4ൗ
0.5 = 0.6 0.5 = 0.8
Lift 0.3ൗ 0.4ൗ
0.5 × 0.4 = 1.5 0.5 × 0.4 = 2
Leverage 0.3 − 0.5 × 0.4 = 0.1 0.4 − 0.5 × 0.4 = 0.2
Smaller Greater

milk and bread have a stronger association than milk and eggs.
26
Evaluation of Candidate Rules
• Confidence is able to identify trustworthy rules, but it cannot tell
whether a rule is coincidental.

• Measures such as lift and leverage not only ensure interesting rules
are identified but also filter out the coincidental rules.

• Support, confidence, lift and leverage ensures the discovery of


interesting and strong rules from sample dataset.

27
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
28
Example 3

Five Market Basket


transactions with items
labelled as I1, I2, I3 and so on.

29
Example 4
TID List of Items
• Consider a database, D , consisting of 9 transactions.
T100 I1, I2, I5

T101 I2, I4 • Suppose minimum support count required is 2 (i.e.


min_sup = 2/9 = 22 % )
T102 I2, I3

T103 I1, I2, I4


• Let minimum confidence required be 70%.
T104 I1, I3

T105 I2, I3 • We have first to find out the frequent itemsets using
T106 I1, I3 Apriori algorithm.
T107 I1, I2 ,I3, I5

T108 I1, I2, I3 • Then, Association rules will be generated using min.
support & min. confidence.

30
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
31
Applications of Association Rules
The term market basket analysis refers to a specific implementation of
association rules.
• For better merchandising – products to include/exclude from
inventory each month
• Placement of products
• Cross-selling
• Promotional programs—multiple product purchase incentives
managed through a loyalty card program
32
Applications of Association Rules
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items

• Example: according to the transaction data…


“Customer who bought a laptop computer and a virus protection software, also
bought extended service plan 70 percent of the time."

• How do you use such a pattern/knowledge?


• Put the items next to each other for ease of finding
• Promote the items as a package (do not put one on sale if the other(s) are on sale)
• Place items far apart from each other so that the customer must walk the aisles to search for it,
and by doing so potentially seeing and buying other items

33
Applications of Association Rules
Recommender systems – Amazon, Netflix:
• Clickstream analysis from web usage log files
• Website visitors to page X click on links A,B,C more than on links D,E,F

In medicine:
• relationships between symptoms and illnesses;
• diagnosis and patient characteristics and treatments (to be used in
medical DSS);
• genes and their functions (to be used in genomics projects)..
34
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Application of Association Rules

Examples

Validation and Testing

Diagnostic
35
Validation and Testing
• The frequent and high confidence itemsets are found by pre-specified minimum support
and minimum confidence levels

• Measures like lift and/or leverage then ensure that interesting rules are identified rather
than coincidental ones

• However, some of the remaining rules may be considered subjectively uninteresting


because they don’t yield unexpected profitable actions

• E.g., rules like {paper} → {pencil} are not interesting/meaningful

• Incorporating subjective knowledge requires domain experts

• Good rules provide valuable insights for institutions to improve their business operations

36
Outline
Overview

Apriori Algorithm

Evaluation of Candidate Rules

Examples

Application of Association Rules

Validation and Testing

Diagnostic
37
Diagnostics
• Although the Apriori algorithm is easy to understand and implement, some
of the rules generated are uninteresting or practically useless.

• Additionally, some of the rules may be generated due to coincidental


relationships between the variables.

• Measures like confidence, lift, and leverage should be used along with
human insights to address this problem.

38
Diagnostics
• Another problem with association rules is that, in Phase 3 and 4 of the Data
Analytics Lifecycle , the team must specify the minimum support prior to
the model execution, which may lead to too many or too few rules.

• In related research, a variant of the algorithm can use a predefined target


range for the number of rules so that the algorithm can adjust the minimum
support accordingly.

• Algorithm requires a scan of the entire database to obtain the result.


Accordingly, as the database grows, it takes more time to compute in each
run.

39
Diagnostics
Approaches to improve Apriori’s efficiency:

Partitioning:
• Any itemset that is potentially frequent in a transaction database must be frequent in at least one of the partitions of the
transaction database.
Sampling:
• This extracts a subset of the data with a lower support threshold and uses the subset to perform association rule mining.

Transaction reduction:
• A transaction that does not contain frequent k-itemsets is useless in subsequent scans and therefore can be ignored.

Hash-based itemset counting:


• If the corresponding hashing bucket count of a k-itemset is below a certain threshold, the k-itemset cannot be frequent.

Dynamic itemset counting:


• Only add new candidate itemsets when all of their subsets are estimated to be frequent.

40

You might also like