RDataMining Slides Association Rules PDF
RDataMining Slides Association Rules PDF
Yanchang Zhao
http://www.RDataMining.com
July 2019
∗
Chapter 9 - Association Rules, in R and Data Mining: Examples and Case
Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications
Exercise
2 / 68
Association Rules
I To discover association rules showing itemsets that occur
together frequently [Agrawal et al., 1993].
I Widely used to analyze retail basket or transaction data.
I An association rule is of the form A ⇒ B, where A and B are
itemsets or attribute-value pair sets and A ∩ B = ∅.
I A: antecedent, left-hand-side or LHS
I B: consequent, right-hand-side or RHS
I The rule means that those database tuples having the items in
the left hand of the rule are also likely to having those items
in the right hand.
I Examples of association rules:
I bread ⇒ butter
I computer ⇒ software
I age in [25,35] & income in [80K,120K] ⇒ buying up-to-date
mobile handsets
3 / 68
Association Rules
4 / 68
An Example
5 / 68
An Example
5 / 68
An Example
5 / 68
An Example
5 / 68
An Example
5 / 68
An Example
5 / 68
An Example
5 / 68
Association Rule Mining
6 / 68
Downward-Closure Property
7 / 68
Itemset Lattice
Frequent
Infrequent
8 / 68
Apriori
9 / 68
Apriori Process
10 / 68
From [?] 11 / 68
FP-growth
†
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Frequent_Pattern_Mining/The_FP-Growth_Algorithm
12 / 68
FP-tree
I The frequent-pattern tree (FP-tree) is a compact structure
that stores quantitative information about frequent patterns in
a dataset. It has two components:
I A root labeled as “null” with a set of item-prefix subtrees as
children
I A frequent-item header table
I Each node has three attributes:
I Item name
I Count: number of transactions represented by the path from
root to the node
I Node link: links to the next node having the same item name
I Each entry in the frequent-item header table also has three
attributes:
I Item name
I Head of node link: point to the first node in the FP-tree
having the same item name
I Count: frequency of the item
13 / 68
FP-tree
14 / 68
The FP-growth Algorithm
15 / 68
The FP-growth Algorithm
I Recursive processing of this compressed version of main
dataset grows large item sets directly, instead of generating
candidate items and testing them against the entire database.
I Growth starts from the bottom of the header table (having
longest branches), by finding all instances matching given
condition.
I New tree is created, with counts projected from the original
tree corresponding to the set of instances that are conditional
on the attribute, with each node getting sum of its children
counts.
I Recursive growth ends when no individual items conditional
on the attribute meet minimum support threshold, and
processing continues on the remaining header items of the
original FP-tree.
I Once the recursive process has completed, all large item sets
with minimum coverage have been found, and association rule
creation begins.
16 / 68
ECLAT
17 / 68
ECLAT
I It works recursively.
I The initial call uses all single items with their tid-sets.
I In each recursive call, it verifies each itemset tid-set pair
(X , t(X )) with all the other pairs to generate new candidates.
If the new candidate is frequent, it is added to the set Px .
I Recursively, it finds all frequent itemsets in the X branch.
18 / 68
ECLAT
From [?]
19 / 68
Interestingness Measures
20 / 68
Objective Interestingness Measures
I Support, confidence and lift are the most widely used
objective measures to select interesting rules.
I Many other objective measures introduced by Tan et al.
[Tan et al., 2002], such as φ-coefficient, odds ratio, kappa,
mutual information, J-measure, Gini index, laplace,
conviction, interest and cosine.
I Different measures have different intrinsic properties and there
is no measure that is better than others in all application
domains.
I In addition, any-confidence, all-confidence and bond, are
designed by Omiecinski [Omiecinski, 2003].
I Utility is used by Chan et al. [Chan et al., 2003] to find top-k
objective-directed rules.
I Unexpected Confidence Interestingness and Isolated
Interestingness are designed by Dong and Li
[Dong and Li, 1998] by considering its unexpectedness in
terms of other association rules in its neighbourhood.
21 / 68
Subjective Interestingness Measures
22 / 68
Interestingness Measures - I
23 / 68
Interestingness Measures - II
24 / 68
Applications
I Market basket analysis
I Identifying associations between items in shopping baskets,
i.e., which items are frequently purchsed together
I Can be used by retailers to understand customer shopping
habits, do selective marketing and plan shelf space
I Churn analysis and selective marketing
I Discovering demographic characteristics and behaviours of
customers who are likely/unlikely to switch to other telcos
I Identifying customer groups who are likely to purchase a new
service or product
I Credit card risk analysis
I Finding characteristics of customers who are likely to default
on credit card or mortgage
I Can be used by banks to reduce risks when assessing new
credit card or mortgage applications
25 / 68
Applications (cont.)
26 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications
Exercise
27 / 68
Association Rule Mining Algorithms in R
28 / 68
The Titanic Dataset
29 / 68
Pipe Operations in R
30 / 68
## download data
download.file(url="http://www.rdatamining.com/data/titanic.raw.rdata",
destfile="./data/titanic.raw.rdata")
## structure of data
titanic.raw %>% str()
## 'data.frame': 2201 obs. of 4 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 ...
## $ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1...
31 / 68
## draw a random sample of 5 records
idx <- 1:nrow(titanic.raw) %>% sample(5)
titanic.raw[idx, ]
## Class Sex Age Survived
## 2080 2nd Female Adult Yes
## 1162 Crew Male Adult No
## 954 Crew Male Adult No
## 2172 3rd Female Adult Yes
## 456 3rd Male Adult No
32 / 68
Function apriori()
33 / 68
## mine association rules
library(arules) ## load required library
rules.all <- titanic.raw %>% apriori() ## run the APRIORI algorithm
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime
## 0.8 0.1 1 none FALSE TRUE 5
## support minlen maxlen target ext
## 0.1 1 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 220
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done ...
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s]. 34 / 68
rules.all %>% length() ## number of rules discovered
## [1] 27
## run APRIORI again to find rules with rhs containing "Survived" only
rules.surv <- titanic.raw %>% apriori(
control = list(verbose=F),
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No",
"Survived=Yes"),
default="lhs"))
## keep three decimal places
quality(rules.surv) <- rules.surv %>% quality() %>% round(digits=3)
## sort rules by lift
rules.surv.sorted <- rules.surv %>% sort(by="lift")
36 / 68
rules.surv.sorted %>% inspect() ## print rules
## lhs rhs support confidence lif...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1.000 3.09...
## [2] {Class=2nd, ...
## Sex=Female, ...
## Age=Child} => {Survived=Yes} 0.006 1.000 3.09...
## [3] {Class=1st, ...
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.01...
## [4] {Class=1st, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.064 0.972 3.01...
## [5] {Class=2nd, ...
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.71...
## [6] {Class=Crew, ...
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.69...
## [7] {Class=Crew, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.009 0.870 2.69...
## [8] {Class=2nd, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.036 0.860 2.66...
## [9] {Class=2nd, ...
37 / 68
Redundant Rules
38 / 68
Redundant Rules
## redundant rules
rules.surv.sorted[1:2] %>% inspect()
## lhs rhs support confidence lift...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1 3.096...
## [2] {Class=2nd, ...
## Sex=Female, ...
## Age=Child} => {Survived=Yes} 0.006 1 3.096...
40 / 68
Remaining Rules
rules.surv.pruned %>% inspect() ## print rules
## lhs rhs support confidence lift...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1.000 3.096...
## [2] {Class=1st, ...
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.010...
## [3] {Class=2nd, ...
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.716...
## [4] {Class=Crew, ...
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.692...
## [5] {Class=2nd, ...
## Sex=Male, ...
## Age=Adult} => {Survived=No} 0.070 0.917 1.354...
## [6] {Class=2nd, ...
## Sex=Male} => {Survived=No} 0.070 0.860 1.271...
## [7] {Class=3rd, ...
## Sex=Male, ...
## Age=Adult} => {Survived=No} 0.176 0.838 1.237...
## [8] {Class=3rd, ...
## Sex=Male} => {Survived=No} 0.192 0.827 1.222...
41 / 68
rules.surv.pruned[1] %>% inspect() ## print rules
## lhs rhs support confidence
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.011 1
## lift count
## [1] 3.096 24
42 / 68
rules.surv.pruned[1] %>% inspect() ## print rules
## lhs rhs support confidence
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.011 1
## lift count
## [1] 3.096 24
42 / 68
Find Rules about Age Groups
I Use lower thresholds to find all rules for children of different
classes
I verbose=F: suppress progress report
I minlen=3: find rules that contain at least three items
I Use lower threshholds for support and confidence
I rhs=c(...), rhs=c(...): find rules whose left/right-hand
sides are in the list
I quality(...): interestingness measures
43 / 68
Rules about Age Groups
rules.age %>% inspect() ## print rules
## lhs rhs support
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034
## [3] {Class=1st,Age=Adult} => {Survived=Yes} 0.089504771
## [4] {Class=2nd,Age=Adult} => {Survived=Yes} 0.042707860
## [5] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151
## [6] {Class=3rd,Age=Adult} => {Survived=Yes} 0.068605179
## confidence lift count
## [1] 1.0000000 3.0956399 24
## [2] 1.0000000 3.0956399 6
## [3] 0.6175549 1.9117275 197
## [4] 0.3601533 1.1149048 94
## [5] 0.3417722 1.0580035 27
## [6] 0.2408293 0.7455209 151
1
1.25
I 1.2X-axis:
0.95
support
1.15
I Y-axis:
confidence
0.9
confidence
1.1
I Color: lift
1.05
0.85
1
0.95
lift
0.2 0.4 0.6 0.8
support
45 / 68
Items in LHS Group
2 rules: {Age=Child, Class=2nd, +1 items}
46 / 68
rules.surv %>% plot(method="graph",
control=list(layout=igraph::with_fr()))
Age=Child
Class=2nd
Survived=Yes
Sex=Female
Class=Crew
Sex=Male
Age=Adult Survived=No
Class=1st
Class=3rd
47 / 68
rules.surv %>% plot(method="graph",
control=list(layout=igraph::in_circle()))
Age=AdultSex=Male
Age=Child Sex=Female
Survived=No Class=Crew
Survived=Yes Class=3rd
Class=2nd
Class=1st
48 / 68
rules.surv %>% plot(method="paracoord",
control=list(reorder=T))
Survived=Yes
Survived=No
Age=Adult
Class=1st
Class=Crew
Sex=Female
Class=2nd
Age=Child
Sex=Male
Class=3rd
3 2 1 rhs
Position
49 / 68
Interactive Plots and Reorder rules
interactive = TRUE
I Selecting and inspecting one or multiple rules
I Zooming
I Filtering rules with an interesting measure
reorder = TRUE
I To improve visualisation by reordering rules and minimizing
crossovers
I The visualisation is likely to change from run to run.
50 / 68
Wrap Up
51 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications
Exercise
52 / 68
Further Readings
I Association Rule Learning
https://en.wikipedia.org/wiki/Association_rule_learning
I Data Mining Algorithms In R: Apriori
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_Apriori_Algorithm
I Data Mining Algorithms In R: ECLAT
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_Eclat_Algorithm
I Data Mining Algorithms In R: FP-Growth
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_FP-Growth_Algorithm
I FP-Growth Implementation by Christian Borgelt
http://www.borgelt.net/fpgrowth.html
I Frequent Itemset Mining Implementations Repository
http://fimi.ua.ac.be/data/
53 / 68
Further Readings
I More than 20 interestingness measures, such as chi-square,
conviction, gini and leverage
Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right
interestingness measure for association patterns. In Proc. of KDD ’02,
pages 32-41, New York, NY, USA. ACM Press.
I More reviews on interestingness measures:
[Silberschatz and Tuzhilin, 1996], [Tan et al., 2002] and
[Omiecinski, 2003]
I Post mining of association rules, such as selecting interesting
association rules, visualization of association rules and using
association rules for classification [Zhao et al., 2009]
Yanchang Zhao, et al. (Eds.). “Post-Mining of Association Rules:
Techniques for Effective Knowledge Extraction”, ISBN
978-1-60566-404-0, May 2009. Information Science Reference.
I Package arulesSequences: mining sequential patterns
http://cran.r-project.org/web/packages/arulesSequences/
54 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications
Exercise
55 / 68
The Mushroom Dataset I
I The mushroom dataset includes descriptions of hypothetical
samples corresponding to 23 species of gilled mushrooms ‡ .
I A csv file with 8,124 observations on 23 categorical variables:
1. class: edible=e, poisonous=p
2. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
3. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
4. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
5. bruises?: bruises=t,no=f
6. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
7. gill-attachment: attached=a,descending=d,free=f,notched=n
8. gill-spacing: close=c,crowded=w,distant=d
9. gill-size: broad=b,narrow=n
10. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,
green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y
56 / 68
The Mushroom Dataset II
57 / 68
The Mushroom Dataset III
21. spore-print-color:
black=k,brown=n,buff=b,chocolate=h,green=r,
orange=o,purple=u,white=w,yellow=y
22. population: abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
23. habitat: grasses=g,leaves=l,meadows=m,paths=p,
urban=u,waste=w,woods=d
‡
https://archive.ics.uci.edu/ml/datasets/Mushroom
58 / 68
Load Mushroom Dataset
59 / 68
The Mushroom Dataset
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : Factor w/ 2 levels "e","p": 2 ...
## $ cap-shape : Factor w/ 6 levels "b","c","f"...
## $ cap-surface : Factor w/ 4 levels "f","g","s"...
## $ cap-color : Factor w/ 10 levels "b","c","e...
## $ bruises : Factor w/ 2 levels "f","t": 2 ...
## $ odor : Factor w/ 9 levels "a","c","f"...
## $ gill-attachment : Factor w/ 2 levels "a","f": 2 ...
## $ gill-spacing : Factor w/ 2 levels "c","w": 1 ...
## $ gill-size : Factor w/ 2 levels "b","n": 2 ...
## $ gill-color : Factor w/ 12 levels "b","e","g...
## $ stalk-shape : Factor w/ 2 levels "e","t": 1 ...
## $ stalk-root : Factor w/ 5 levels "?","b","c"...
## $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s"...
## $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s"...
## $ stalk-color-above-ring : Factor w/ 9 levels "b","c","e"...
## $ stalk-color-below-ring : Factor w/ 9 levels "b","c","e"...
## $ veil-type : Factor w/ 1 level "p": 1 1 1 1...
## $ veil-color : Factor w/ 4 levels "n","o","w"...
## $ ring-number : Factor w/ 3 levels "n","o","t"...
60 / 68
Exercise
61 / 68
Mining Association Rules from Mushroom Dataset
## find associatin rules from the mushroom dataset
rules <- apriori(mushrooms, control = list(verbose=F),
parameter = list(minlen=2, maxlen=5),
appearance = list(rhs=c("class=p", "class=e"),
default="lhs"))
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="confidence")
inspect(head(rules.sorted))
## lhs rhs support confidence
## [1] {ring-type=l} => {class=p} 0.160 1
## [2] {gill-color=b} => {class=p} 0.213 1
## [3] {odor=f} => {class=p} 0.266 1
## [4] {gill-size=b,gill-color=n} => {class=e} 0.108 1
## [5] {odor=n,stalk-root=e} => {class=e} 0.106 1
## [6] {bruises=f,stalk-root=e} => {class=e} 0.106 1
## lift count
## [1] 2.075 1296
## [2] 2.075 1728
## [3] 2.075 2160
## [4] 1.931 880
## [5] 1.931 864
62 / 68
Online Resources
63 / 68
The End
Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
64 / 68
How to Cite This Work
I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}
65 / 68
References I
Agrawal, R., Imielinski, T., and Swami, A. (1993).
Mining association rules between sets of items in large databases.
In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 207–216,
Washington D.C. USA.
Freitas, A. A. (1998).
On objective measures of rule surprisingness.
In PKDD ’98: Proceedings of the Second European Symposium on Principles of Data Mining and
Knowledge Discovery, pages 1–9, London, UK. Springer-Verlag.
Han, J. (2005).
Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
66 / 68
References II
Liu, B. and Hsu, W. (1996).
Post-analysis of learned rules.
In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 828–834,
Portland, Oregon, USA.
Omiecinski, E. R. (2003).
Alternative interest measures for mining associations in databases.
IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.
67 / 68
References III
Zhao, Y. (2012).
R and Data Mining: Examples and Case Studies, ISBN 978-0-12-396963-7.
Academic Press, Elsevier.
68 / 68