0% found this document useful (0 votes)
10 views42 pages

Class 4-Associative Analysis

The document discusses data mining techniques, focusing on associative analysis and the discovery of frequent and infrequent patterns. It outlines various data mining tasks and methods, particularly emphasizing association rule discovery, its applications in retail and other fields, and the challenges associated with mining these rules. Additionally, it explains key concepts such as support, confidence, and the Apriori principle, along with the computational complexities involved in mining association rules.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views42 pages

Class 4-Associative Analysis

The document discusses data mining techniques, focusing on associative analysis and the discovery of frequent and infrequent patterns. It outlines various data mining tasks and methods, particularly emphasizing association rule discovery, its applications in retail and other fields, and the challenges associated with mining these rules. Additionally, it explains key concepts such as support, confidence, and the Apriori principle, along with the computational complexities involved in mining association rules.

Uploaded by

eltcarva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Prof.

Heitor Silvério Lopes


Prof. Thiago H. Silva

Data Mining &


Knowledge
Discovery
Class 4 – Associative Analysis:
frequent and infrequent pattern
discovery
2025
Tasks x Methods in Data Mining
Tasks Methods
Classification Decision trees (C4.5), Classification rules, k-nearest-neighboors,
Random forest, Support vector machine, Bayesian classifier,
Neural network, Adaboost
Association Rules Apriori, FP-growth, Eclat, Zigzag

Regression Linear Regression, Polynomial regression, Logistic regression

Feature Selection & Principal component analysis (PCA), Chi-square, Entropy,


Dimensionality Reduction Information gain

Clustering K-means, Kohonen’s self-organized map, Density-based scan,


Hierarchical grouping, t-SNE
Data visualization * Silhouette plot, scatter plot, heatmap, box plot, clusters, t-SNE
Where do impulse purchases happen most ?
Supermarkets or Shoppings ?

● 34% in Supermarkets
● 25% in Shopping malls
● 19% in online e-commerce
Why impulse purchases ?
● Impulse buy because:
○ Emotional reason (retail therapy)
○ Lack of economic education
○ Belief it is a deal
○ Shopping causes a dopamine effect
What encourages someone to buy something?
RECOMMENDATION

ASSOCIATION
Two important questions for retail companies
● How to Discover the interests of people who browse the
internet, social networks, shopping malls and
supermarkets ?
● How to associate personal interests with products
and/or services in order to encourage consuption ?

● Answer: Association Rule Discovery !!!


● Why? Discover non-obvious relationships between itens
in a data set
Application areas of Association Rules
● Retail sales → Products purchased together
● Recommendation systems → Find products based on shared interests with other people
● Control and supervision systems → Discover relationships between events and failures
● Bioinformatics → Discover unknown interactions between genes and diseases
● Medical diagnosis → Discover unknown cause-effect relationships of drugs
Some difficulties found for discovering Association Rules
● A large amount of (sparce) data is needed to obtain any
useful knowledge
● The algorithms used are computationally (very)
expensive (it is amany-to-many mapping!)
● Some relationships found may happen by chance
● There is no “cause → effect” proof
Associaton rules - Objective
Given a set of transactions, find rules to predict the occurence
of an item based on the occurrence of other items in the set

Shopping basket transactions Some Association Rules (or


frequent itemsets

{Bread} → {Milk}

customer {Beer} → {Diapers}


purchases
William Bread, Milk {Milk} → {Coke}

Gabriel Bread, Diapers, Beer, Eggs


Mary Milk, Diapers, Beer, Coke
Thiago Bread, Milk, Diapers, Beer Implication means
co-occurrency,
Sophia Bread, Milk, Diapers, Coke NOT causality!
Very important definitions customer purchases
William Bread, Milk
● Itemsets
○ A collection of zero or more items. Ex. {Milk, Bread, Gabriel Bread, Diapers, Beer, Eggs
Diapers}
○ K-itemset: a set containing k items Mary Milk, Diapers, Beer, Coke
● Support Count (σ): Thiago Bread, Milk, Diapers, Beer
○ Frequence of occurrence of an itemset
Sophia Bread, Milk, Diapers, Coke
○ E.g. 𝜎({Eggs}) = 1 and 𝜎({Milk, Diapers}) = 3
● Frequent Itemsets:
○ It is an itemset whose support is larger than or equal to
a given threshold minsup
○ E.g. considering minsup=4, {Bread}, {Milk}, {Diapers} are
frequent itemsets, but NOT {Milk, Diapers} !!
● Association Rule:
○ It is an expression with an implication of two itemsets,
such as: X→Y. Example: {Milk, Diapers} →{Beer}
Association Rules Evaluation customer purchases
William Bread, Milk
● Support (s): Gabriel Bread, Diapers, Beer, Eggs
○ It is the fraction of transactions
Mary Milk, Diapers, Beer, Coke
that contain both X and Y
Thiago Bread, Milk, Diapers, Beer
Sophia Bread, Milk, Diapers, Coke

● Confidence (c):
○ Measures how frequently the Consider this association rule:
itens of Y appear in the 𝑀𝑖𝑙𝑘, 𝐷𝑖𝑎𝑝𝑒𝑟𝑠 → 𝐵𝑒𝑒𝑟
transactions that contain X:
σ(𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠,𝐵𝑒𝑒𝑟)
𝑠= = 2/5 = 0,4
𝑁
σ(𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠,𝐵𝑒𝑒𝑟)
𝑐= = 2/3 = 0,67
σ(𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠)
How to interpret support and confidence
● Support (S):
○ A rule with low support means that it can happen by chance, since items will
rarely occur together. E.g. {Diapers, Beer}→Coke, s=1/5

● Confidence (c):
○ Measures the reliability of the inference made by a rule.
○ A high confidence means that Y is more likely to be present in X transactions
○ Rules with high confidence but very low support are generally not of much
interest. E.g. {eggs}→{beer}
Mining association rules
● Given a set of T transactions, the task of mining association rules consists
in finding all rules that satisfy these requirements:
○ support ≥ minsup
○ confidence ≥ minconf

● Brute-force approach:
○ List all possible association rules
○ Compute support and confidence for all rules
○ Delete rules that do not satisfy minsup and minconf thresholds
○ Such an approach is computationally prohibitive for very large itemsets
A two-steps approach for mining association rules
1. Generation of frequent itemsets:
○ All itemsets that satisfy support ≥ minsup
2. Generation of rules:
○ For each itenset, generate rules with high confidence
○ Each rule is a binary partition of this set

● Recall that the Generation of frequent itemsets is still


computationally costly
1: Generation of frequent itemsets
customer purchases Examples of rules and their metrics:
{Milk,Diapers} → {Beer} (s=2/5=0,4, c=2/3=0,67)
William Bread, Milk
{Milk,Beer} → {Diapers} (s=0,4, c=1,0)
Gabriel Bread, Diapers, Beer, Eggs {Diapers,Beer} → {Milk} (s=0,4, c=0,67)
Mary Milk, Diapers, Beer, Coke {Beer} → {Milk,Diapers} (s=0,4, c=0,67)
{Diapers} → {Milk,Beer} (s=0,4, c=0,5)
Thiago Bread, Milk, Diapers, Beer {Milk} → {Diapers,Beer} (s=0,4, c=0,5)
Sophia Bread, Milk, Diapers, Coke

● All the above rules are binary partitions of the same itemset:
{Milk,Diapers,Beer}
● Rules originating from the same itemset have identical support, but different
confidence
● Therefore, If the set {Milk,Diapers,Beer} is infrequent, all 6 rules generated from
it will also be infrequent.
● Thus, the support and confidence requirements can be considered separately
1: Generation of frequent itemsets
● Given d items, there are
2𝑑 possible itemsets
● Example: d=5 → 32
itemsets
● Example: d=100, 2100 =
1.267.650.600.228.229.4
01.496.703.205.376
itemsets

One septillion, two hundred sixty-seven sextillion,


six hundred fifty quintillion, six hundred
quadrillion, two hundred twenty-eight trillion, two
hundred twenty-nine billion, four hundred one
million, four hundred ninety-six thousand, seven
hundred three, two hundred five, three hundred
seventy-six.
1: Generation of frequent itemsets
● Brute-force approach:
○ Each frequent itemset in the graph is a candidate
○ Compute the support for each canditate by examining the dataset

List of candidate Transactions


itemsets
customer purchases
William Bread, Milk
Gabriel Bread, Diapers, Beer, Eggs
Mary Milk, Diapers, Beer, Coke
Thiago Bread, Milk, Diapers, Beer
Sophia Bread, Milk, Diapers, Coke

○ Compare each transaction against all candidates


■ Computational complexity: ~O(N.M.W)
■ Still very hard, since: M = 2d
1: Generation of sets of frequent itemsets
● Given d items (unique!):
○ The total number of frequent itemsets is 2d d−1

d  d−kd− 
k
○ Number of possible association rules: R= 
k
   
 j 


k=   j=
1 1 

○ For d=6: {Milk,Bread,Diapers,Beer,Eggs,Coke} +
=3d
−2d 1
+1
d #itemsets #rules
1 2 0
2 4 2
3 8 12
4 16 50
5 32 180
6 64 602
10000 1,995 E3010 1,63 E4771
1: How to reduce the number of frequent itemsets?
● Reduce the number of candidates (M):
○ For a full search: M=2d
○ Use pruning techniques to reduce M
● Reduce the number of transactions (N):
○ Reduce the size of N as the itemset size increases
● Reduce the number of comparisons (NM):
○ Use some efficient data structure to store the set of
candidates or the set of transactions
1: The Apriori principle
If an item set is frequent, then all its subsets must
also be frequent

● This is the Apriori Principle, which is supported by the property of the


support metric (anti-monotonicity) that establishes that:
○ The support of an itemset never is greater than the support of its subsets. That is:


X,
Y :
(X
Y
) s
(X
) s
(
Y)
1: Illustration of the
Apriori principle

Frequent item sets →


frequent subsets
1: Illustration of the
Apriori principle
Set of
Infrequent items

Support-based
prunning

Prunned supersets
1: Example of itemset generation using Apriori Tid purchases
minsup=0,60
1 Bread, Milk
Minimum count = 3
2 Bread, Diapers, Beer, Eggs
3 Milk, Diapers, Beer, Coke
4 Bread, Milk, Diapers, Beer
6
(1) Item Count
4 (42) Itemset Count
5 Bread, Milk, Diapers, Coke
Bread
Coke
Milk
2
4
{Bread, Milk}
{Bread,Beer}
3
2
( 43 ) Itemset Count
{Bread,Milk,Diapers} 2
Beer 3 {Bread,Diapers} 3
{Brea,Milk,Beer} 1
Diapers 4 {Milk,Beer} 2
Eggs 1 {Milk,Diapers,Beer} 2
{Milk,Diapers} 3
{Beer,Diapers} 3 {Bread,Diapers,Milk} 2

If all the subsets were enumerated:


C(6,1) + C(6,2) + C(6,3) = 6+15+20 = 41
By using pruning based on a minimal support: 6 + 6 + 4 = 16
1: How to adjust the minimun support?
● If minsup is set too high, itemsets involving rare items may be missed
○ E.g. If it is related to purchases, it may be very expensive products that are purchased
infrequently
● If minsup is too low, the search for Association Rules becomes
computationally expensive and the number of itemsets may be very large
○ Is should be considered that real datasets are high-dimensional (many transactions and
many items)
● There is no other way out: the minimum support must be a Variable, and
it is up to the user to dynamically adjust it accordingly to:
○ Problem requirements
○ Computational power
○ Results obtained to date
2: Generation and prunning of candidate rules
● Generation of candidates:
○ Generate new candidate of k-itemsets based on the frequent (k-1)-itemsets of
the previous iteraction

● Candidate pruning using the Apriori principle:


○ Eliminate some of the candidates from k-itemsets
○ Each (k-1)-itemset must be frequent, otherwise it is infrequent.
○ Example:
■ If {a,b,c} is frequent, then: {a,b}, {b,c}, and {a,c} must be frequent.
■ If any of them is not frequent, then {a,b,c} is not frequent
2: Generation and prunning of candidate rules
● Brute-force method:
○ Generate all combinations and prune
candidates without minimum support
(minsup)
○ As mentioned before, the computational
complexity for each candidate at level k
is O(k) and for the entire method is
O(d.𝟐(𝒅−𝟏) )
2: Consequences of the computational complexity
● Minimum support value:
○ Decreasing the minimum support threshold results in more frequent itemsets
○ This can increase the number of candidates and the maximum lenght of frequent
itemsets (W)
● Dimensionality (number of items) of the dataset:
○ More memory is required to store the support counters for each item
○ If the number of frequent items also increases, both the processing cost and the I/O
cost increase
● Size of the transaction database:
○ Since the Apriori algorithm traverses the databasemultiple times, the processing time
of the algorithm increases with the number of transactions
● Average width of transactions:
○ The average width increases with denser datasets
○ This can increase W and the number of subsets in a transaction icreass with its
maximum width
Case study #1
● Dataset: Foodmart 2000
● What is the relationship
between purchases of
lightbulbs and
batteries?

Step 1: Generation of
frequent itemsets (that
contain lightbulbs)
Case study #1
● Association rules
that must include
batteries in the
antecedent
Case study #2
● Titanic dataset
● Apriori with minsup=10%
and minconf=90%
● Frequent itemsets:
Case study #2
● Titanic dataset
● Association rules found:
How to evaluate association rules ?
● Association Rule mining algorithms tend to produce a large number of
rules for different values of Support and Confidence
● However, many of these rules are redundant or uninteresting
○ A rule is redundant if {A,B,C}→{D} and {A,B}→{D}, and both rules have the same support
and confidence
○ (Quantitatively) uninteresting patterns:
■ Involving a set of mutually independent items (the occurrence of one event does
not affect the probability of the other)
■ Covering very few transactions
○ (Qualitatively) uninteresting patterns:
■ That are obvious or expected (for most people), e.g., {bread→milk}
● Measures of interest can be used for pruning or ranking the association
rules obtained by an algorithm
How to evaluate association rules ?
● Objective measures:
○ The association rules are ranked based on statistical methods computed over the data
○ An association metric is used, such as: Support, Confidence, Laplace, Gini, Mutual
Information, Jaccard index, etc
● Subjective measures:
○ The association rules are ranked according to the user interpretation.
○ E.g. a rule is subjectively interested when it contradicts the user’s expectation

A. Silberschatz & A.Tuzhilin, On subjective measures of


interestingness in knowledge discovery. Proc.
Knowledge Discovery in Databases conference, p. 275-
281,1995
Objective metrics for the evaluation of
association rules
● Support (s):
○ Fraction of transactions that contain both, X and Y customer purchases
William Bread, Milk
Gabriel Bread, Diapers, Beer, Eggs
● Confidence (c): Mary Milk, Diapers, Beer, Coke
○ Measures how frequent the itens in Y appear in Thiago Bread, Milk, Diapers, Beer
transactions that contain X Sophia Bread, Milk, Diapers, Coke

● Lift: {Milk,Diapers}→Beer
○ Confidence divided by the propotion of instances covered
𝜎 𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠,𝐵𝑒𝑒𝑟 2
by the consequent 𝑆= = = 0,4
𝑁 5
○ Measures the importance of the association that is 𝜎 𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠,𝐵𝑒𝑒𝑟 2
𝐶= = = 0,67
independent of support 𝜎 𝑀𝑖𝑙𝑘,𝐷𝑖𝑎𝑝𝑒𝑟𝑠 3

𝐶 0,4
𝑙𝑖𝑓𝑡 = = = 0,67
𝑆 𝐵𝑒𝑒𝑟 0,6
Objective metrics for the evaluation of
association rules
● Leverage (L):
○ Proportion of additional instances covered by both, antecedente and
consequent, above of what would be expected if both were statistically
independent

● Conviction (conviction):
○ Measures the independency between antecedente and consequent

𝑃 𝑋 . 𝑃(! 𝑌)
𝐶𝑜𝑛𝑣𝑖𝑐𝑡𝑖𝑜𝑛 =
𝑃 𝑋, ! 𝑌
Objective metrics for the
evaluation of association rules
● There are many
metrics proposed
in the literature
● Some of them may
be good for certain
applications, but
not for others
● There is no clear
definition of their
usefulness...
On transforming nominal and categorical attributes
● Many datasets can contain atributes of diferente types in a list of items
● It is necessary to convert them to a suitable format to be explored by
Associative Analysis methods
On transforming nominal and categorical attributes
● Sex: Simmetric binary atribute → two binary atributes (M, F)
● Education: categorical atribute → three binary atributes (PG, S, 2G)
● Problem: if the values of the nominal atributes are infrequent (e.g. State), it may
not gerate frequent items → Grouping
On transforming continuous attributes
● Discretization methods:
○ Equal intervals, same frequenc of data, etc
○ Problems: discretization can generate a large number of atributes
○ Adjusting the optimal ranges is computionally expensive
Case study #3
● Breast cancer Wisconsin dataset
● Objective: differential diagnosis of breast cancer using characteristics of the
cell nuclei presente in the images
Case study #3
● Breast cancer Wisconsin dataset
● Question: What are the antecedentes that lead to malignancy ?
Associative analysis – advanced topics
● Association rules for infrequent/negatively correlated patterns
● Association rules for sequential (temporal) patterns
● Association rules for graphs

● In all cases, extensive preliminar manipulation of the dataset


may be necessary to allow the use of conventional software

You might also like