0% found this document useful (0 votes)
65 views32 pages

06 Association Rules

The document discusses association analysis, also known as market basket analysis. It defines key concepts such as itemsets, support count, frequent itemsets, and association rules. The Apriori algorithm is introduced as an efficient way to generate frequent itemsets and association rules by exploiting the anti-monotone property of support, where any subset of a frequent itemset must also be frequent. Implementation of association rule mining in Python is briefly mentioned using libraries like PyCaret and Apyori.

Uploaded by

samia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views32 pages

06 Association Rules

The document discusses association analysis, also known as market basket analysis. It defines key concepts such as itemsets, support count, frequent itemsets, and association rules. The Apriori algorithm is introduced as an efficient way to generate frequent itemsets and association rules by exploiting the anti-monotone property of support, where any subset of a frequent itemset must also be frequent. Implementation of association rule mining in Python is briefly mentioned using libraries like PyCaret and Apyori.

Uploaded by

samia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Data Mining and

Analysis

Descriptive Modelling:
Association Rule Analysis

Dr Daqing Chen
Outline
• What is association analysis (market basket analysis)?
• Key concepts and terminologies:
– Itemset
– k-itemset
– Support count and support of an itemset
– Frequent itemset (large itemset)
– Support, confidence, and lift of an association rule
• Apriori algorithm:
– How it works
– How to use it to generate frequent itemsets and further generate association rules
• Implementation in Python:
– PyCaret, more powerful
– Apyori, simple

20/04/2023 DMA Lecture 06 2


Association Rule Mining: Basic Concepts
• Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items in
the transaction
• Also known as market basket analysis, Affinity analysis, can be used
for cross-selling, planning store layout, recommendations, etc.
• More generally, find associations or links between different
attributes or attribute-value pairs
Market basket transactions

Example of Association Rules

{Diapers}  {Beer},
{Milk, Bread}  {Eggs, Coke},
{Beer, Bread}  {Milk}, ……

Implication
20/04/2023
“ →” means co-occurrence/correlation,
DMA Lecture 06
not causality!
l3
Association Rule Mining: Basic Concepts
• Item: A distinct object or a unique attribute-value pair (Recall:
data matrix for transaction records, Item=Beer, Item=Diapers, …)
• Itemset: A collection of one or more items
• k-itemset: An itemset that contains k items
• Support count of an itemset: The frequency of occurrence of an
itemset in a dataset – Simply COUNT!
• Support: Fraction of transactions that contain a certain itemset
in a dataset
• Frequent itemset: An itemset whose support is not less than a
pre-defined minimum support threshold, also called large
itemset

20/04/2023 DMA Lecture 06 4


Basic Concepts: An Example
• Distinct items: Bread, Milk, Diapers, Beer, Eggs, Coke
• Itemsets:
– 1-itemsets:
{Bread}, {Milk}, {Diapers}, {Beer}, {Eggs}, {Coke} (each distinct item is an 1-itemset)
– 2-itemsets:
{Bread, Milk}, {Bread, Diapers}, {Bread, Beer}, {Bread, Eggs}, {Bread, Coke}, {Milk,
Diaper}, {Milk, Beer}, {Milk, Eggs}, {Milk, Coke}, … (All possible combinations of any
two of the six distinct items)
– 3-itemsets: {Bread, Milk, Diapers} …
• What is the max size of the itemsets that can be extracted from the dataset,
i.e., the max number of items in an itemset?
– 6-itemset {Bread, Milk, Diapers, Beer, Eggs, Coke}
• How many itemsets in total can be created? 2n
2
• Support count ({Bread, Milk, Diapers} ) = ?
2/5=0.4=40%
• Support ({Bread, Milk, Diapers} ) = ?
20/04/2023 DMA Lecture 06 5
Rule Evaluation Metrics
• Association Rule: Let X and Y denote two disjoint
itemsets, , an association rule is an implication
expression of the form
e. g.: , or
, , ....
• Association rule indicates: IF someone buys , THEN
s/he is likely to buy , too.
How to indicate the likelihood?
Which rules are interesting and useful?
How to measure them?
20/04/2023 DMA Lecture 06 6
Rule Evaluation Metrics
Each rule has two basic measures
• Support:
– Defined as the ratio of the total number of transactions that contain
both and to the total number of transactions in a given dataset
– Represents how frequently the itemsets (and ) appear in a given
dataset, i.e., the frequency of the occurring pattern
• Confidence:
– Defined as the ratio of the total number of transactions that contain
both and to the total number of transactions that contain only
– Indicates the strength of implication in the rule, i.e., how often the
rule has been found to be true
– Represents the conditional probability that Y is true when X is known
to be true
20/04/2023 DMA Lecture 06 7
Rule Evaluation Metrics
– S = Support Support count /Total number of transactions

– C Confidence Containing every


item in both X and Y

𝑠𝑢𝑝𝑝𝑜𝑟𝑡 (𝑋 ∪𝑌 ) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡( 𝑋 ∪𝑌 )


¿ = =𝑃𝑟 (¿𝑌 / 𝑋 )¿
𝑠𝑢𝑝𝑝𝑜𝑟𝑡( 𝑋 ) 𝑠𝑢𝑝𝑝𝑜𝑟𝑡 𝑐𝑜𝑢𝑛𝑡( 𝑋)
A conditional
probability of Y given x

We are only interested in strong rules which satisfy


both minimum support threshold and minimum
confidence threshold
20/04/2023 DMA Lecture 06 8
Association Rules: An Example
Example of association rules from this
transaction dataset:
{Milk, Diapers} → {Beer}: s=2/5=0.4; c=2/3=0.67
{Milk, Beer} → {Diapers}: s=2/5=0.4; c=2/2=1.00

• The Antecedent and Consequent of a rule


– IF ( certain specified patterns occur in the data )
– THEN ( take the appropriate actions)
– The left-hand side of the rule LHS, or the IF part, is known
technically as the antecedent of the rule
– the right-hand side RHS, or the THEN part, is called the
consequent
20/04/2023 DMA Lecture 06 9
Association Rule Mining Task
• Given a set of transactions, the goal of
association rule mining is to find all rules having
– Support ≥ minimum support threshold
– Confidence ≥ minimum confidence threshold
• Two-stage approach:
– Frequent Itemset Generation: Generate all itemsets
whose support  minimum support threshold
– Rule Generation: Generate high confidence rules from
each frequent itemset which have confidence 
minimum confidence threshold

20/04/2023 DMA Lecture 06 10


Approaches to Association Rules Mining
• Brute-force (Naïve) approach
– List all possible item sets
– List all possible association rules
– Calculate the support and confidence for each rule
– Prune rules that fail the minimum support and
minimum confidence thresholds
However, this is computationally prohibitive

20/04/2023 DMA Lecture 06 11


Computational Complexity
• Given d distinct items:
– Total number of itemsets = 2d
– Total number of possible association rules increases
exponentially when d increases
– Think about how many distinct items a retail, like Tesco,
provides, and how many itemsets to be checked?

d 2d

5 32
10 1024
20 1048576
40 1.1E+12 If d = 6, R = 602 rules

20/04/2023 DMA Lecture 06 12


Computational Complexity
• Itemset lattice: A structure showing all the possible
itemsets, lexicographically ordered, that can be
generated from a given number of distinct items
• Do we have to search and check all the itemsets one by
one?

20/04/2023 DMA Lecture 06 13


Generating Frequent Itemsets Efficiently
• Brute-force approach is too expensive and not practical …
How to find all frequent itemsets effectively?
• Apriori approach: popular and effective
We know that finding 1-itemsets is easy, so …
• Idea: only use frequent itemsets to generate a bigger
itemset and ignore any in-frequent itemset
• Start with frequent 1-itemsets to generate 2-itemsets,
and use frequent 2-itemsets and 1-itemsets to generate
3-itemsets, and so on ...
• Is this approach valid?
20/04/2023 DMA Lecture 06 14
Generating Frequent Itemsets Efficiently
• Apriori principle:
– If an itemset is frequent, then all of its subsets are also frequent,
i.e.,
If is a frequent itemset, then and are frequent itemsets as well
– In general, if is a frequent k-itemset, then all (k-1)-item subsets
of are also frequent
• Apriori principle holds due to the following property of the
support measure:
∀ 𝑋 , 𝑌 : 𝑋 ⊆𝑌 ⇒ 𝑠 ( 𝑋)≥ 𝑠 (𝑌 )
– The support of an itemset never exceeds the support of any its
subsets
– This is known as the anti-monotone property of support
20/04/2023 DMA Lecture 06 15
Discussion on Apriori Principle
• Known: ∀ 𝑋 , 𝑌 , 𝐼 𝑓 𝑋 ⊆ 𝑌 , 𝑇h𝑒𝑛 𝑠( 𝑋)≥ 𝑠 (𝑌 )
• Consider the relationship between a subset and
its superset (and vice visa) in terms of support
Scenario Explanation

𝑚𝑖𝑛𝑆 >𝑆 (𝑋 )≥ 𝑆(𝑌 ) If a subset is infrequent, then any of its superset


is infrequent
If a superset is frequent, then any of its subset is
frequent
𝑚𝑖𝑛𝑆 <𝑆 (𝑋 )≥ 𝑆(𝑌 ) If a subset is frequent, then any of its superset
may or may not frequent
If a superset is infrequent, then any of its
subset may or may not frequent
20/04/2023 DMA Lecture 06 16
Illustrating Apriori Principle
null
If itemset CDE is
frequent, then
any subsets of
A B C D E
CDE are frequent

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE


Frequent
Itemset
20/04/2023 ABCDE DMA Lecture 06 17
Illustrating Apriori Principle
If itemset AB is infrequent,
then any itemset containing
AB are infrequent

Infrequent
itemset

Pruned
supersets
20/04/2023 DMA Lecture 06 18
Apriori Algorithm: Reducing Number of
Candidate Itemsets
Give d distinct items () and pre-defined minimum
support threshold
Apriori Algorithm for Searching and Generating Frequent itemsets
1: Set
2: Repeat
List all individual candidate k-itemsets.
Count the support for each itemset. Select only the frequent k-itemsets that
satisfy the predefined minimum support threshold. Ignore any infrequent
itemsets.
Use the remaining frequent k-itemsets to generate candidate (k+1)-itemsets
k= k+1
3: Until k= d

20/04/2023 DMA Lecture 06 19


Rule Generation
• Give a frequent itemset L, find all non-empty subsets such
that F L – F satisfies the minimum confidence requirement,
i.e., simply split a frequent itemset into two parts, one as and
remaining as , to form different rule sets
– Example: If is a frequent itemset, candidate rules:
ABC D, ABD C, ACD B, BCD A,
AB  CD, AC  BD, AD  BC, BC  AD, BD  AC, CD  AB
A BCD, B ACD, C ABD, D ABC,
– Note: All of these rules have the same support
• In general, if L contains k items, then there are candidate
association rules (ignoring and )

20/04/2023 DMA Lecture 06 20


Rule Generation
• How to efficiently generate rules from frequent itemsets?
– In general, confidence does not have anti-monotone property, e.g.,
c(ABCD) can be larger or smaller than
c(ABD)
– But confidence of rules generated from the same frequent itemset
has anti-monotone property with regards to the number of items
on the RHS of the rule
– e.g., L= {A,B,C,D}:
c (ABC  D)  c(AB  CD)  c(A  BCD)
– How to prove this?
,

We know ≤
20/04/2023 ≤ , so which of the
DMAabove
Lecture 06has the biggest value? 21
Rule Generation
• In other words, if is low, then any rules containing
in its consequent (RHS) will be low, e.g.,
If  is low, then any rules containing in its
consequent (RHS) will be low
,  A, 
, , 
• Important fact: For a given association rule, moving
items from the antecedent to the consequent never
changes support, and never increases confidence

20/04/2023 DMA Lecture 06 22


Rule Generation Using Apriori Algorithm
If the confidence for {BCD} → {A} is low, then all the rules
containing item A in its consequent can be disregarded
Lattice of rules

Pruned
Rules
20/04/2023 DMA Lecture 06 23
Discussion
• Data format: binary, nominal
• Attribute-value pairs and transactions: data matrix
• Support count - essential
• Confidence is not necessarily the best measure: Other measures have
been devised: lift - correlation
𝐶𝑜𝑛𝑓 { 𝑋 → 𝑌 } 𝑠𝑢𝑝( 𝑋 ∪𝑌 ) 𝑃𝑟 ( 𝑋 ∪ 𝑌 )
𝐿𝑖𝑓𝑡= = =
𝑠𝑢𝑝 (𝑌 ) 𝑠𝑢𝑝( 𝑋 )𝑠𝑢𝑝(𝑌 ) 𝑃𝑟 ( 𝑋 ) 𝑃𝑟 (𝑌 )
• are independent, and items are randomly purchased together.
• : negatively associated - the occurrence of inhibits the occurrence of .
• : positively associated - the occurrence of prompts the occurrence of ,
and items are purchase together more often than random.

20/04/2023 DMA Lecture 06 24


Discussion
• Suppose a transaction dataset contains milk and bread as frequent
itemsets ( out of 2000 transactions on a given day):
– Set Min Support = 40%
– Set Min confidence = 70%
milk not milk Total
bread 900 750 1650
not bread 300 50 350
total 1200 800 2000

S(bread)=1650/2000=82.5%, S(milk)=1200/2000=60%
C(milk → bread)=900/1200=75%, C(beard → milk) = 900/1650=54%
Negatively associated: buying one item results in a decrease in buying the
other item
Lift=0.45/(0.6*0.825)=0.91 <1 DMA Lecture 06
20/04/2023 25
Discussion
• How to set an appropriate minimum support
threshold?
– If it is set too high, we could miss item sets involving
interesting rare items (e.g., expensive products in transaction
records; unit failures in student records)
– If it is set too low, it is computationally expensive, and the
number of itemsets to create is very large
• Using a single minimum support threshold may not be
effective
• Using the support count or support of each distinct
item as a reference
20/04/2023 DMA Lecture 06 26
Using the Support Count or Support of Each
Distinct Item as a Reference
• What would be an appropriate min support threshold in order to find
out any association rules relating to item C?
• What would be an appropriate min support threshold in order to find
out any association rules relating to item E?

20/04/2023 DMA Lecture 06 27


Implement Apriori in Python
• Pycaret

• Apyori: in both Jupyter notebook and JupyterLab

20/04/2023 DMA Lecture 06 28


Use apriori: An Example

20/04/2023 DMA Lecture 06 29


Use Pycaret: An Example

• Use InvoiceNo along with Description (or


StockCode) for association rule analysis

20/04/2023 DMA Lecture 06 30


Use Pycaret: An Example (Cont’d)
• Visualise results:

20/04/2023 DMA Lecture 06 31


Summary
• Association analysis: the basic concepts
• Frequent item sets, strong rules, support count,
the support and confidence of a rule
• Generating frequent item sets and strong rules:
– Brute-force approach
– Apriori approach
• Other measures: lift
• How to determine a proper threshold

20/04/2023 DMA Lecture 06 32

You might also like