0% found this document useful (0 votes)
30 views35 pages

Statistical Computing With R: Masters in Data Sciences 503 (S28) Third Batch, SMS, TU, 2024

The document discusses the application of association rule mining, particularly in the context of market basket analysis, using the R programming language. It explains key concepts such as the apriori algorithm, support, confidence, and lift, which are essential for identifying relationships between items in transactional data. The document also provides practical examples and R code snippets to illustrate how to implement these techniques for data analysis in retail settings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views35 pages

Statistical Computing With R: Masters in Data Sciences 503 (S28) Third Batch, SMS, TU, 2024

The document discusses the application of association rule mining, particularly in the context of market basket analysis, using the R programming language. It explains key concepts such as the apriori algorithm, support, confidence, and lift, which are essential for identifying relationships between items in transactional data. The document also provides practical examples and R code snippets to illustrate how to implement these techniques for data analysis in retail settings.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Statistical Computing with R:

Masters in Data Sciences 503 (S28)


Third Batch, SMS, TU, 2024
Shital Bhandary
Associate Professor
Statistics/Bio-statistics, Demography and Public Health Informatics
Patan Academy of Health Sciences, Lalitpur, Nepal
Faculty, Data Analysis and Decision Modeling, MBA, Pokhara University, Nepal
Faculty, FAIMER Fellowship in Health Professions Education, India/USA.
Review Preview: Unsupervised models
• Association rules learning • Monte Carlo simulations
• Market-Basket analysis • Good old days!

• Class imbalance problem


• Statistical approach
• Data science approach
Association rules learning/mining:
https://towardsdatascience.com/association-rule-mining-in-r-ddf2d044ae50

• Association Rule Mining (also • A most common example that we


called as Association Rule Learning) encounter in our daily lives —
is a common technique used to Amazon knows what else you want
find associations (co-occurrence) to buy when you order something
between many variables. on their site.
• The same idea extends to Spotify
• It is often used by grocery stores, e- too — They know what song you
commerce websites, and anyone want to listen to next.
with large transactional databases. • All of these incorporate, at some
level, data mining concepts and
association rule mining algorithms.
Association rules: example problem
https://www.datacamp.com/community/tutorials/market-basket-analysis-r

• You get a client who runs a retail • Your client will use your findings
store and gives you data for all to not only change/update/add
transactions that consists of items in inventory but also use
items bought in the store by them to change the layout of
several customers over a period the physical store or rather an
of time. online store.
• To find results that will help your
• Your client then asks you to use client, you will use Market
that data to help boost their Basket Analysis (MBA) which
business. uses Association Rule Mining on
the given transaction data.
Use of association rules mining result:
https://www.datacamp.com/community/tutorials/market-basket-analysis-r

• Changing the store layout • Cross marketing on online stores


according to trends
• What are the trending items
• Customer behavior analysis customers buy

• Catalogue design • Customized emails with add-on


sales

• etc.
Association rule mining: If => Then analyis
https://www.datacamp.com/community/tutorials/market-basket-analysis-r

• Association Rule Mining is used • The applications of Association


when you want to find an Rule Mining are found in
association between different Marketing, Basket Data Analysis (or
objects in a set, find frequent Market Basket Analysis) in retailing,
patterns in a transaction database, clustering and classification.
relational databases or any other • It can tell you what items do
information repository. customers frequently buy together
by generating a set of rules
called Association Rules.
• In simple words, it gives you output
as rules in form if this then that.
What is apriori algorithm and rule?
http://r-statistics.co/Association-Mining-With-R.html
• Association mining is usually • A rule is a notation that
done on transactions data from represents which item/s is
a retail market or from an online frequently bought with what
e-commerce store. item/s.
• It has an LHS and an RHS part
• Since most transactions data is and can be represented as
large, the apriori algorithm follows:
makes it easier to find these itemsetA => itemsetB
patterns or rules quickly. • This means, the item/s on the
right were frequently purchased
along with items on the left.
How to measure the strength of a rule?
http://r-statistics.co/Association-Mining-With-R.html
• The apriori algorithm generates Lets consider the rule A => B in order to compute these
metrics.
the most relevant set of rules
from a given transaction data. Support=Number of transactions with both A and B/Total
number of transactions
• It also shows the support, =P(A∩B) = frequency(A,B)/N
confidence and lift of those Confidence=Number of transactions with both A and B/Tot
rules. al number of transactions with A
=P(A∩B)/P(A) = frequency(A,B)/frequency(A)
• These three measures can be
used to decide the relative ExpectedConfidence=Number of transactions with B/Total
number of transactions
strength of the rules. =P(B)=frequency(B)/N
• How are they computed?
Lift=Confidence/Expected Confidence
=P(A∩B)/P(A).P(B) = Support(A,B)/Support(A).Support(B)
Association rule: Support and confidence
• Association rules are given in the Computer=>Anti−virusSoftware
form as below: [Support=20%,confidence=60%]
A=>B[Support,Confidence]
Above rule says:
• The part before => is referred to as • 20% transaction show Anti-virus
if (Antecedent) and the part after software is bought with purchase
=> is referred to as then of a Computer (support)
(Consequent).

• Where A and B are sets of items in • 60% of customers who purchase


the transaction data. A and B are Anti-virus software is bought with
disjoint sets. purchase of a Computer
(confidence)
Lift:
• Lift is the factor by which, the • lift = 1: implies no association
co-occurence of A and B exceeds between items.
the expected probability of A
and B co-occuring, had they
been independent. • lift > 1: greater than 1 means
that item B is likely to be bought
if item A is bought,
• So, higher the lift, higher the
chance of A and B occurring
together. • lift < 1: less than 1 means that
item B is unlikely to be bought if
item A is bought.
Note:
• Frequent Itemsets: • Strong rules:
If a rule A=>B[Support, Confidence]
Item-sets whose support is greater satisfies min_sup and
or equal than minimum support min_confidence then it is a strong
threshold (min_sup). rule.

• min_sup is set on user choice. • Coverage:


Coverage (also called cover or LHS-
support) is the support of the left-
hand-side of the rule, i.e., supp(X).
It represents a measure of “to how
often the rule can be applied”.
Example:
https://www.datacamp.com/community/tutorials/market-basket-analysis-r
Calculate the following for {Bread => Milk}:
• Support for (Bread) • Support for (Bread)=4/5 =f(B)/N=0.8
• Support for (Milk) • Support for (Milk)=4/5=f(M)/N=0.8
• Support for (Break, Milk) • Support(B,M) = f(B,M)/N=3/5=0.6

• Confidence (Bread => Milk) • Confidence (Bread => Milk) =3/4=0.75


• ExpectedConfidence=P(M)=4/5=0.8
• ExpectedConfidence(Bread=>Milk)
• Lift (Bread => Milk)
• Lift (Bread => Milk) =Confidence/ExpectedConfidence=0.75/0.80
=0.9375
• Coverage(Break=>Milk) = support(lhs) OR
=support(A,B)/support(A).support(B)
=(0.6)/[(0.8).(0.8)] = 0.6/0.64 = 0.9375
Let’s do it in R!
# create a list of baskets > # create a list of baskets
market_basket <- > market_basket <-
list( + list(
c("bread", "milk"), + c("bread", "milk"),
c("bread", "diapers", "beer", "Eggs"), + c("bread", "diapers", "beer", "Eggs"),
c("milk", "diapers", "beer", "cola"), + c("milk", "diapers", "beer", "cola"),
c("bread", "milk", "diapers", "beer"), + c("bread", "milk", "diapers", "beer"),
c("bread", "milk", "diapers", "cola") + c("bread", "milk", "diapers", "cola")
) + )
>
# set transaction names (T1 to T5) > # set transaction names (T1 to T5)
names(market_basket) <- paste("T", c(1:5), sep > names(market_basket) <- paste("T", c(1:5),
= "") sep = "")
Let’s use “arules” package and get some
outputs:
• library(arules) #Transformation to transactions data
#Transformation trans <- as(market_basket, "transactions")
• trans <- as(market_basket, "transactions")
#Dimensions # dim(trans)
• dim(trans) • [1] 5 6 #5 transactions, 6 items
#Item labels
• itemLabels(trans) #Item labels
#Summary > itemLables(trans)
• summary(trans)
#Plot [1] "beer" "bread" "cola" "diapers"
• image(trans) "Eggs" "milk“
Let’s use “arules” package and get some
outputs:
transactions as itemMatrix in sparse format with
#Summary
5 rows (elements/itemsets/transactions) and
• summary(trans)
6 columns (items) and a density of 0.6 (non-zero cells)

most frequent items:


bread diapers milk beer cola (Other)
4 4 4 3 2 1

element (itemset/transaction) length distribution:


sizes
2 4 (Itemset)
1 4 (transactions)

Min. 1st Qu. Median Mean 3rd Qu. Max.


2.0 4.0 4.0 3.6 4.0 4.0
Let’s inspect the “trans”
• inspect(trans) • items transactionID
• [1] {bread, milk} T1
• [2] {beer, bread, diapers, Eggs} T2
• [3] {beer, cola, diapers, milk} T3
• [4] {beer, bread, diapers, milk} T4
• [5] {bread, cola, diapers, milk} T5
Plot of “trans”
#Plot
• image(trans)
Apriori algorithm: why?
• Frequent Itemset Generation is • For this APRIORI Algorithm is used
the most computationally to create new rules.
expensive step because it requires • Since Support and Confidence
a full database scan. measure how interesting the rule
is, we will use them to create rules.
• In above example, we have seen • New rule is set by the minimum
the example of only 5 transactions, support and minimum confidence
but in real-world transaction data thresholds.
for retail can exceed up to GB s • The closer to threshold the more
and TBs of data for which an the rule is of use to the client.
optimized algorithm is needed to
prune out Item-sets that will not • These thresholds set by client help
help in later steps. to compare the rule strength
according to your own or client's
will.
Apriori algorithm in “trans” with minimum
support of 0.3 and min. confidence of 0.5:
#Min Support 0.3, confidence as Apriori
0.5. • Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen maxlen

rules <- apriori(trans, 0.5 0.1 1 none FALSE TRUE 5 0.3 1 10

parameter = list(supp=0.3, conf=0.5,


target ext
maxlen=10,
rules TRUE
target= "rules"))
Algorithmic control:
Note: maxlen = maximum length of
the transaction! We could have used filter tree heap memopt load sort verbose
maxlen = 4 here as we know it but 0.1 TRUE TRUE FALSE TRUE 2 TRUE
this will not be known in real-life!
Summary of the “rules”:
summary(rules) # mining info:
#summary of quality measures: • data ntransactions support confidence
support confidence coverage lift • trans 5 0.3 0.5
Min. :0.4000 Min. :0.5000 Min. :0.4000 Min. :0.8333
1st Qu.:0.4000 1st Qu.:0.6667 1st Qu.:0.6000 1st Qu.:0.8333
Median :0.4000 Median :0.7500 Median :0.6000 Median :1.0000
Mean :0.4938 Mean :0.7474 Mean :0.6813 Mean :1.0473
3rd Qu.:0.6000 3rd Qu.:0.8000 3rd Qu.:0.8000 3rd Qu.:1.2500
Max. :0.8000 Max. :1.0000 Max. :1.0000 Max. :1.6667
Inspection of the “rules” with minlen:
Inspect (rules) #Output from R:
lhs rhs support confidence coverage lift count
• [1] {} => {beer} 0.6 0.6000000 1.0 1.0000000 3
• [2] {} => {milk} 0.8 0.8000000 1.0 1.0000000 4
• [3] {} => {bread} 0.8 0.8000000 1.0 1.0000000 4
• [4] {} => {diapers} 0.8 0.8000000 1.0 1.0000000 4
• [5] {cola} => {milk} 0.4 1.0000000 0.4 1.2500000 2
• [6] {milk} => {cola} 0.4 0.5000000 0.8 1.2500000 2
• [7] {cola} => {diapers} 0.4 1.0000000 0.4 1.2500000 2
• [8] {diapers} => {cola} 0.4 0.5000000 0.8 1.2500000 2
• [9] {beer} => {milk} 0.4 0.6666667 0.6 0.8333333 2
• [10] {milk} => {beer} 0.4 0.5000000 0.8 0.8333333 2
• [11] {beer} => {bread} 0.4 0.6666667 0.6 0.8333333 2
• [12] {bread} => {beer} 0.4 0.5000000 0.8 0.8333333 2
• [13] {beer} => {diapers} 0.6 1.0000000 0.6 1.2500000 3
• [14] {diapers} => {beer} 0.6 0.7500000 0.8 1.2500000 3
• [15] {milk} => {bread} 0.6 0.7500000 0.8 0.9375000 3
• [16] {bread} => {milk} 0.6 0.7500000 0.8 0.9375000 3
• ….
• [32]
We can remove the “empty” rules
rules <- apriori(trans, • set of 28 rules
parameter = list(supp=0.3, • rule length distribution (lhs + rhs):
conf=0.5, sizes
maxlen=10, • 2 3
minlen=2, • 16 12
target= "rules")) •
lhs rhs support confidence coverage lift count
• [1] {cola} => {milk} 0.4 1.0000000 0.4 1.2500000 2
• [2] {milk} => {cola} 0.4 0.5000000 0.8 1.2500000 2
• [3] {cola} => {diapers} 0.4 1.0000000 0.4 1.2500000 2
• …
• [17] {cola, milk} => {diapers} 0.4 1.0000000 0.4 1.2500000 2
• [18] {cola, diapers} => {milk} 0.4 1.0000000 0.4 1.2500000 2
• [19] {diapers, milk} => {cola} 0.4 0.6666667 0.6 1.6666667 2
Let’s set RHS rule for “trans” data:
#For example, to analyze what items lhs rhs support confidence coverage lift count
customers buy before buying {beer}, • [1] {bread} => {beer} 0.4 0.5000000 0.8 0.8333333 2
• [2] {milk} => {beer} 0.4 0.5000000 0.8 0.8333333 2
#we set rhs=beer and default=lhs: • [3] {diapers} => {beer} 0.6 0.7500000 0.8 1.2500000 3
beer_rules_rhs <- apriori(trans, • [4] {bread, diapers} => {beer} 0.4 0.6666667 0.6 1.1111111 2

parameter = • [5] {diapers, milk} => {beer} 0.4 0.6666667 0.6 1.1111111 2
list(supp=0.3, conf=0.5,
maxlen=10,
minlen=2),
appearance = list(default="lhs",
rhs="beer"))
#Inspect
• inspect(beer_rules_rhs)
Let’s set LHS rule for “trans” data:
#For example, to analyze what items lhs rhs support confidence coverage lift count
customers buy before buying {beer}, [1] {beer} => {bread} 0.4 0.6666667 0.6 0.8333333 2
[2] {beer} => {milk} 0.4 0.6666667 0.6 0.8333333 2
#we set lhs=beer and default=rhs: [3] {beer} => {diapers} 0.6 1.0000000 0.6 1.2500000 3
beer_rules_lhs <- apriori(trans,
parameter =
list(supp=0.3, conf=0.5,
maxlen=10,
minlen=2),
appearance =
list(lhs="beer", default="rhs"))
#Inspect the result:
inspect(beer_rules_lhs)
Product recommendation rule:
#Product recommendation rule lhs rhs support confidence coverage lift n
• [1] {cola} => {milk} 0.4 1 0.4 1.25 2
• rules_conf <- sort (rules, • [2] {cola} => {diapers} 0.4 1 0.4 1.25 2
by="confidence", • [3] {beer} => {diapers} 0.6 1 0.6 1.25 3
decreasing=TRUE) • [4] {cola, milk} => {diapers} 0.4 1 0.4 1.25 2
• [5] {cola, diapers} => {milk} 0.4 1 0.4 1.25 2
• [6] {beer, milk} => {diapers} 0.4 1 0.4 1.25 2
#inspect the rule
# show the support, lift and
confidence for all rules
• inspect(head(rules_conf))
Plotting rules with “arulesViz” package:
• library(arulesViz)
• plot(rules)
Plotting rules with “arulesViz” package:
• plot(rules, measure =
"confidence")
Plotting rules with “arulesViz” package:
• plot(rules, method = "two-key
plot")
Interactive plot with “plotly” engine:
• #Interactive plot
• plot(rules, engine = "plotly")
Graph based visualization:
#Graph based visualization
subrules <- head(rules, n = 10, by
= "confidence")
plot(subrules, method = "graph",
engine = "htmlwidget")
Parallel coordinate plot for 10 rules:
#Paraller coordinate plot
• plot(subrules,
method="paracoord")
More here:
• Like the one we did before:

• https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/

• Real life example:

• https://www.youtube.com/watch?v=91CmrpD-4Fw
Question/queries?
• Next class • Monte Carlo Simulations
• Class imbalance problem
• Statistical approach
• Data sciences approach
Thank you!
@shitalbhandary

You might also like