W5 - Apriori

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

Data Science and Big Data Analytics

Chap 5: Adv Analytical Theory and


Methods: Association Rules
Chapter Sections

 5.1 Overview
 5.2 Apriori Algorithm
 5.3 Evaluation of Candidate Rules
 5.4 Example: Transactions in a Grocery Store
 5.5 Validation and Testing
 5.6 Diagnostics
5.1 Overview
 Association rules method
 Unsupervised learning method
 Descriptive (not predictive) method
 Used to find hidden relationships in data
 The relationships are represented as rules
 Questions association rules might answer
 Which products tend to be purchased together
 What products do similar customers tend to buy
5.1 Overview
 Example – general logic of association rules
5.1 Overview
 Rules have the form X -> Y
 When X is observed, Y is also observed
 Itemset
 Collection of items or entities
 k-itemset = {item 1, item 2,…,item k}
 Examples
 Items purchased in one transaction
 Set of hyperlinks clicked by a user in one session
5.1 Overview – Apriori Algorithm

 Apriori is the most fundamental algorithm


 Given itemset L, support of L is the percent of
transactions that contain L
 Frequent itemset – items appear together “often
enough”
 Minimum support defines “often enough” (% transactions)
 If an itemset is frequent, then any subset is frequent
5.1 Overview – Apriori Algorithm
 If {B,C,D} frequent, then all subsets frequent
5.2 Apriori Algorithm
Frequent = minimum support
 Bottom-up iterative algorithm
 Identify the frequent (min support) 1-itemsets
 Frequent 1-itemsets are paired into 2-itemsets,
and the frequent 2-itemsets are identified, etc.
 Definitions for next slide
 D = transaction database
 d = minimum support threshold
 N = maximum length of itemset (optional parameter)
 Ck = set of candidate k-itemsets
 Lk = set of k-itemsets with minimum support
5.2 Apriori Algorithm
5.3 Evaluation of Candidate Rules
Confidence
 Frequent itemsets can form candidate rules
 Confidence measures the certainty of a rule

 Minimum confidence – predefined threshold


 Problem with confidence
 Given a rule X->Y, confidence considers only the
antecedent (X) and the co-occurrence of X and Y
 Cannot tell if a rule contains true implication
5.3 Evaluation of Candidate Rules
Lift
 Lift measures how much more often X and Y
occur together than expected if statistically
independent

 Lift = 1 if X and Y are statistically independent


 Lift > 1 indicates the degree of usefulness of the rule
 Example – in 1000 transactions,
 If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Lift(milk->eggs) = 0.3/(0.5*0.4) = 1.5
 If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Lift(milk->bread) = 0.4/(0.5*0.4) = 2.0
5.3 Evaluation of Candidate Rules
Leverage
 Leverage measures the difference in the
probability of X and Y appearing together
compared to statistical independence

 Leverage = 0 if X and Y are statistically independent


 Leverage > 0 indicates degree of usefulness of rule
 Example – in 1000 transactions,
 If {milk, eggs} appears in 300, {milk} in 500, and {eggs} in
400, then Leverage(milk->eggs) = 0.3 - 0.5*0.4 = 0.1
 If {milk, bread} appears in 400, {milk} in 500, and {bread}
in 400, then Leverage (milk->bread) = 0.4 - 0.5*0.4 = 0.2
5.4 Applications of Association Rules

 The term market basket analysis refers to a


specific implementation of association rules
 For better merchandising – products to
include/exclude from inventory each month
 Placement of products within related products
 Association rules also used for
 Recommender systems – Amazon, Netflix
 Clickstream analysis from web usage log files
 Website visitors to page X click on links A,B,C more than on
links D,E,F
5.6 Validation and Testing
 The frequent and high confidence itemsets are found by pre-
specified minimum support and minimum confidence levels
 Measures like lift and/or leverage then ensure that
interesting rules are identified rather than coincidental ones
 However, some of the remaining rules may be considered
subjectively uninteresting because they don’t yield
unexpected profitable actions
 E.g., rules like {paper} -> {pencil} are not interesting/meaningful
 Incorporating subjective knowledge requires domain experts
 Good rules provide valuable insights for institutions to
improve their business operations
5.7 Diagnostics

 Although minimum support is pre-specified in phases 3&4,


this level can be adjusted to target the range of the number
of rules – variants/improvements of Apriori are available
 For large datasets the Apriori algorithm can be
computationally expensive – efficiency improvements
 Partitioning
 Sampling
 Transaction reduction
 Hash-based itemset counting
 Dynamic itemset counting
arules in R
 https://rpubs.com/emzak208/281776

 https://
rpubs.com/aru0511/GroceriesDatasetAssociationAnaly
sis

You might also like