0% found this document useful (0 votes)
148 views75 pages

RDataMining Slides Association Rules PDF

The document discusses association rule mining and the FP-growth algorithm. It begins with an overview of association rules, including support, confidence and lift measures. It then introduces the FP-growth algorithm, which mines frequent itemsets without generating candidates. FP-growth compresses the database into an FP-tree and divides it into conditional databases mined separately. This reduces search costs by looking for short patterns recursively and concatenating them into long frequent patterns.

Uploaded by

Md Rezaul Karim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views75 pages

RDataMining Slides Association Rules PDF

The document discusses association rule mining and the FP-growth algorithm. It begins with an overview of association rules, including support, confidence and lift measures. It then introduces the FP-growth algorithm, which mines frequent itemsets without generating candidates. FP-growth compresses the database into an FP-tree and divides it into conditional databases mined separately. This reduces search costs by looking for short patterns recursively and concatenating them into long frequent patterns.

Uploaded by

Md Rezaul Karim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Association Rule Mining with R

Yanchang Zhao
http://www.RDataMining.com

R and Data Mining Course


Beijing University of Posts and Telecommunications,
Beijing, China

July 2019


Chapter 9 - Association Rules, in R and Data Mining: Examples and Case
Studies. http://www.rdatamining.com/docs/RDataMining-book.pdf
1 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications

Association Rule Mining with R


Mining Association Rules
Removing Redundancy
Interpreting Rules
Visualizing Association Rules
Wrap Up

Further Readings and Online Resources

Exercise

2 / 68
Association Rules
I To discover association rules showing itemsets that occur
together frequently [Agrawal et al., 1993].
I Widely used to analyze retail basket or transaction data.
I An association rule is of the form A ⇒ B, where A and B are
itemsets or attribute-value pair sets and A ∩ B = ∅.
I A: antecedent, left-hand-side or LHS
I B: consequent, right-hand-side or RHS
I The rule means that those database tuples having the items in
the left hand of the rule are also likely to having those items
in the right hand.
I Examples of association rules:
I bread ⇒ butter
I computer ⇒ software
I age in [25,35] & income in [80K,120K] ⇒ buying up-to-date
mobile handsets

3 / 68
Association Rules

Association rules are rules presenting association or correlation


between itemsets.

support(A ⇒ B) = support(A ∪ B) = P(A ∧ B)


confidence(A ⇒ B) = P(B|A)
P(A ∧ B)
=
P(A)
confidence(A ⇒ B)
lift(A ⇒ B) =
P(B)
P(A ∧ B)
=
P(A)P(B)

where P(A) is the percentage (or probability) of cases containing


A.

4 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support =

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support = P(R ∧ DM) = 6/100 = 0.06

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support = P(R ∧ DM) = 6/100 = 0.06
I confidence =

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support = P(R ∧ DM) = 6/100 = 0.06
I confidence = support / P(R) = 0.06/0.08 = 0.75

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support = P(R ∧ DM) = 6/100 = 0.06
I confidence = support / P(R) = 0.06/0.08 = 0.75
I lift =

5 / 68
An Example

I Assume there are 100 students.


I 10 out of them know data mining techniques, 8 know R
language and 6 know both of them.
I R ⇒ DM: If a student knows R, then he or she knows data
mining.
I support = P(R ∧ DM) = 6/100 = 0.06
I confidence = support / P(R) = 0.06/0.08 = 0.75
I lift = confidence / P(DM) = 0.75/0.1 = 7.5

5 / 68
Association Rule Mining

I Association Rule Mining is normally composed of two steps:


I Finding all frequent itemsets whose supports are no less than a
minimum support threshold;
I From above frequent itemsets, generating association rules
with confidence above a minimum confidence threshold.
I The second step is straightforward, but the first one, frequent
itemset generateion, is computing intensive.
I The number of possible itemsets is 2n − 1, where n is the
number of unique items.
I Algorithms: Apriori, ECLAT, FP-Growth

6 / 68
Downward-Closure Property

I Downward-closure property of support, a.k.a.


anti-monotonicity
I For a frequent itemset, all its subsets are also frequent.
if {A,B} is frequent, then both {A} and {B} are frequent.
I For an infrequent itemset, all its super-sets are infrequent.
if {A} is infrequent, then {A,B}, {A,C} and {A,B,C} are
infrequent.
I Useful to prune candidate itemsets

7 / 68
Itemset Lattice

Frequent

Infrequent

8 / 68
Apriori

I Apriori [Agrawal and Srikant, 1994]: a classic algorithm for


association rule mining
I A level-wise, breadth-first algorithm
I Counts transactions to find frequent itemsets
I Generates candidate itemsets by exploiting downward closure
property of support

9 / 68
Apriori Process

1. Find all frequent 1-itemsets L1


2. Join step: generate candidate k-itemsets by joining Lk−1 with
itself
3. Prune step: prune candidate k-itemsets using
downward-closure property
4. Scan the dataset to count frequency of candidate k-itemsets
and select frequent k-itemsets Lk
5. Repeat above process, until no more frequent itemsets can be
found.

10 / 68
From [?] 11 / 68
FP-growth

I FP-growth: frequent-pattern growth, which mines frequent


itemsets without candidate generation [Han et al., 2004]
I Compresses the input database creating an FP-tree instance
to represent frequent items.
I Divides the compressed database into a set of conditional
databases, each one associated with one frequent pattern.
I Each such database is mined separately.
I It reduces search costs by looking for short patterns recursively
and then concatenating them in long frequent patterns.†


https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/
Frequent_Pattern_Mining/The_FP-Growth_Algorithm
12 / 68
FP-tree
I The frequent-pattern tree (FP-tree) is a compact structure
that stores quantitative information about frequent patterns in
a dataset. It has two components:
I A root labeled as “null” with a set of item-prefix subtrees as
children
I A frequent-item header table
I Each node has three attributes:
I Item name
I Count: number of transactions represented by the path from
root to the node
I Node link: links to the next node having the same item name
I Each entry in the frequent-item header table also has three
attributes:
I Item name
I Head of node link: point to the first node in the FP-tree
having the same item name
I Count: frequency of the item

13 / 68
FP-tree

From [Han, 2005]

14 / 68
The FP-growth Algorithm

I In the first pass, the algorithm counts occurrence of items


(attribute-value pairs) in the dataset, and stores them to a
header table.
I In the second pass, it builds the FP-tree structure by inserting
instances.
I Items in each instance have to be sorted by descending order
of their frequency in the dataset, so that the tree can be
processed quickly.
I Items in each instance that do not meet minimum coverage
threshold are discarded.
I If many instances share most frequent items, FP-tree provides
high compression close to tree root.

15 / 68
The FP-growth Algorithm
I Recursive processing of this compressed version of main
dataset grows large item sets directly, instead of generating
candidate items and testing them against the entire database.
I Growth starts from the bottom of the header table (having
longest branches), by finding all instances matching given
condition.
I New tree is created, with counts projected from the original
tree corresponding to the set of instances that are conditional
on the attribute, with each node getting sum of its children
counts.
I Recursive growth ends when no individual items conditional
on the attribute meet minimum support threshold, and
processing continues on the remaining header items of the
original FP-tree.
I Once the recursive process has completed, all large item sets
with minimum coverage have been found, and association rule
creation begins.
16 / 68
ECLAT

I ECLAT: equivalence class transformation [Zaki et al., 1997]


I A depth-first search algorithm using set intersection
I Idea: use tid (transaction ID) set intersecion to compute the
support of a candidate itemset, avoiding the generation of
subsets that does not exist in the prefix tree.
I t(AB) = t(A) ∩ t(B), where t(A) is the set of IDs of
transactions containing A.
I support(AB) = |t(AB)|
I Eclat intersects the tidsets only if the frequent itemsets share
a common prefix.
I It traverses the prefix search tree in a way of depth-first
searching, processing a group of itemsets that have the same
prefix, also called a prefix equivalence class.

17 / 68
ECLAT

I It works recursively.
I The initial call uses all single items with their tid-sets.
I In each recursive call, it verifies each itemset tid-set pair
(X , t(X )) with all the other pairs to generate new candidates.
If the new candidate is frequent, it is added to the set Px .
I Recursively, it finds all frequent itemsets in the X branch.

18 / 68
ECLAT

From [?]
19 / 68
Interestingness Measures

I Which rules or patterns are interesting (and useful)?


I Two types of rule interestingness measures: subjective and
objective [Freitas, 1998, Silberschatz and Tuzhilin, 1996].
I Objective measures, such as lift, odds ratio and conviction,
are often data-driven and give the interestingness in terms of
statistics or information theory.
I Subjective (user-driven) measures, such as unexpectedness
and actionability, focus on finding interesting patterns by
matching against a given set of user beliefs.

20 / 68
Objective Interestingness Measures
I Support, confidence and lift are the most widely used
objective measures to select interesting rules.
I Many other objective measures introduced by Tan et al.
[Tan et al., 2002], such as φ-coefficient, odds ratio, kappa,
mutual information, J-measure, Gini index, laplace,
conviction, interest and cosine.
I Different measures have different intrinsic properties and there
is no measure that is better than others in all application
domains.
I In addition, any-confidence, all-confidence and bond, are
designed by Omiecinski [Omiecinski, 2003].
I Utility is used by Chan et al. [Chan et al., 2003] to find top-k
objective-directed rules.
I Unexpected Confidence Interestingness and Isolated
Interestingness are designed by Dong and Li
[Dong and Li, 1998] by considering its unexpectedness in
terms of other association rules in its neighbourhood.
21 / 68
Subjective Interestingness Measures

I A pattern is unexpected if it is new to a user or contradicts


the user’s experience or domain knowledge.
I A pattern is actionable if the user can do something with it to
his/her advantage [Silberschatz and Tuzhilin, 1995].
I Liu and Hsu [Liu and Hsu, 1996] proposed to rank learned
rules by matching against expected patterns provided by the
user.
I Ras and Wieczorkowska [Ras and Wieczorkowska, 2000]
designed action-rules which show “what actions should be
taken to improve the profitability of customers”. The
attributes are grouped into “hard attributes” which cannot be
changed and “soft attributes” which are possible to change
with reasonable costs. The status of customers can be moved
from one to another by changing the values of soft ones.

22 / 68
Interestingness Measures - I

From [Tan et al., 2002]

23 / 68
Interestingness Measures - II

From [Tan et al., 2002]

24 / 68
Applications
I Market basket analysis
I Identifying associations between items in shopping baskets,
i.e., which items are frequently purchsed together
I Can be used by retailers to understand customer shopping
habits, do selective marketing and plan shelf space
I Churn analysis and selective marketing
I Discovering demographic characteristics and behaviours of
customers who are likely/unlikely to switch to other telcos
I Identifying customer groups who are likely to purchase a new
service or product
I Credit card risk analysis
I Finding characteristics of customers who are likely to default
on credit card or mortgage
I Can be used by banks to reduce risks when assessing new
credit card or mortgage applications

25 / 68
Applications (cont.)

I Stock market analysis


I Finding relationships between individual stocks, or between
stocks and economic factors
I Can help stock traders select interesting stocks and improve
trading strategies
I Medical diagnosis
I Identifying relationships between symptoms, test results and
illness
I Can be used for assisting doctors on illness diagnosis or even
on treatment

26 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications

Association Rule Mining with R


Mining Association Rules
Removing Redundancy
Interpreting Rules
Visualizing Association Rules
Wrap Up

Further Readings and Online Resources

Exercise

27 / 68
Association Rule Mining Algorithms in R

I Apriori [Agrawal and Srikant, 1994]


I A level-wise, breadth-first algorithm which counts transactions
to find frequent itemsets and then derive association rules from
them
I apriori() in package arules

I ECLAT [Zaki et al., 1997]


I Finds frequent itemsets with equivalence classes, depth-first
search and set intersection instead of counting
I eclat() in package arules

28 / 68
The Titanic Dataset

I The Titanic dataset in the datasets package is a 4-dimensional


table with summarized information on the fate of passengers
on the Titanic according to social class, sex, age and survival.
I To make it suitable for association rule mining, we reconstruct
the raw data as titanic.raw, where each row represents a
person.
I The reconstructed raw data can also be downloaded at
http://www.rdatamining.com/data/titanic.raw.rdata.

29 / 68
Pipe Operations in R

I Load library magrittr for pipe operations


I Avoid nested function calls
I Make code easy to understand
I Supported by dplyr and ggplot2

library(magrittr) ## for pipe operations


## traditional way
b <- fun3(fun2(fun1(a), p2))
## the above can be rewritten to
b <- a %>% fun1() %>% fun2(p2) %>% fun3()

30 / 68
## download data
download.file(url="http://www.rdatamining.com/data/titanic.raw.rdata",
destfile="./data/titanic.raw.rdata")

library(magrittr) ## for pipe operations


## load data, and the name of the R object is titanic.raw
load("../data/titanic.raw.rdata")
## dimensionality
titanic.raw %>% dim()
## [1] 2201 4

## structure of data
titanic.raw %>% str()
## 'data.frame': 2201 obs. of 4 variables:
## $ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3...
## $ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 ...
## $ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 ...
## $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1...

31 / 68
## draw a random sample of 5 records
idx <- 1:nrow(titanic.raw) %>% sample(5)
titanic.raw[idx, ]
## Class Sex Age Survived
## 2080 2nd Female Adult Yes
## 1162 Crew Male Adult No
## 954 Crew Male Adult No
## 2172 3rd Female Adult Yes
## 456 3rd Male Adult No

## a summary of the dataset


titanic.raw %>% summary()
## Class Sex Age Survived
## 1st :325 Female: 470 Adult:2092 No :1490
## 2nd :285 Male :1731 Child: 109 Yes: 711
## 3rd :706
## Crew:885

32 / 68
Function apriori()

I Mine frequent itemsets, association rules or association


hyperedges using the Apriori algorithm.
I The Apriori algorithm employs level-wise search for frequent
itemsets.
I Default settings:
I minimum support: supp=0.1
I minimum confidence: conf=0.8
I maximum length of rules: maxlen=10

33 / 68
## mine association rules
library(arules) ## load required library
rules.all <- titanic.raw %>% apriori() ## run the APRIORI algorithm
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime
## 0.8 0.1 1 none FALSE TRUE 5
## support minlen maxlen target ext
## 0.1 1 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 220
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 2201 transaction(s)] done ...
## sorting and recoding items ... [9 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s]. 34 / 68
rules.all %>% length() ## number of rules discovered
## [1] 27

rules.all %>% inspect() ## print all rules


## lhs rhs support confidence ...
## [1] {} => {Age=Adult} 0.9504771 0.9504771 1...
## [2] {Class=2nd} => {Age=Adult} 0.1185825 0.9157895 0...
## [3] {Class=1st} => {Age=Adult} 0.1449341 0.9815385 1...
## [4] {Sex=Female} => {Age=Adult} 0.1930940 0.9042553 0...
## [5] {Class=3rd} => {Age=Adult} 0.2848705 0.8881020 0...
## [6] {Survived=Yes} => {Age=Adult} 0.2971377 0.9198312 0...
## [7] {Class=Crew} => {Sex=Male} 0.3916402 0.9740113 1...
## [8] {Class=Crew} => {Age=Adult} 0.4020900 1.0000000 1...
## [9] {Survived=No} => {Sex=Male} 0.6197183 0.9154362 1...
## [10] {Survived=No} => {Age=Adult} 0.6533394 0.9651007 1...
## [11] {Sex=Male} => {Age=Adult} 0.7573830 0.9630272 1...
## [12] {Sex=Female, ...
## Survived=Yes} => {Age=Adult} 0.1435711 0.9186047 0...
## [13] {Class=3rd, ...
## Sex=Male} => {Survived=No} 0.1917310 0.8274510 1...
## [14] {Class=3rd, ...
## Survived=No} => {Age=Adult} 0.2162653 0.9015152 0...
## [15] {Class=3rd, ... 35 / 68
I Suppose we want to find patterns of survival and non-survival
I verbose=F: suppress progress report
I minlen=2: find rules that contain at least two items
I Use lower threshholds for support and confidence
I rhs=c(...): find rules whose right-hand sides are in the list
I default="lhs": use default setting for left-hand side
I quality(...): interestingness measures

## run APRIORI again to find rules with rhs containing "Survived" only
rules.surv <- titanic.raw %>% apriori(
control = list(verbose=F),
parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No",
"Survived=Yes"),
default="lhs"))
## keep three decimal places
quality(rules.surv) <- rules.surv %>% quality() %>% round(digits=3)
## sort rules by lift
rules.surv.sorted <- rules.surv %>% sort(by="lift")

36 / 68
rules.surv.sorted %>% inspect() ## print rules
## lhs rhs support confidence lif...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1.000 3.09...
## [2] {Class=2nd, ...
## Sex=Female, ...
## Age=Child} => {Survived=Yes} 0.006 1.000 3.09...
## [3] {Class=1st, ...
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.01...
## [4] {Class=1st, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.064 0.972 3.01...
## [5] {Class=2nd, ...
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.71...
## [6] {Class=Crew, ...
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.69...
## [7] {Class=Crew, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.009 0.870 2.69...
## [8] {Class=2nd, ...
## Sex=Female, ...
## Age=Adult} => {Survived=Yes} 0.036 0.860 2.66...
## [9] {Class=2nd, ...
37 / 68
Redundant Rules

I There are often too many association rules discovered from a


dataset.
I It is necessary to remove redundant rules before a user is able
to study the rules and identify interesting ones from them.

38 / 68
Redundant Rules
## redundant rules
rules.surv.sorted[1:2] %>% inspect()
## lhs rhs support confidence lift...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1 3.096...
## [2] {Class=2nd, ...
## Sex=Female, ...
## Age=Child} => {Survived=Yes} 0.006 1 3.096...

I Rule #2 provides no extra knowledge in addition to rule #1,


since rules #1 tells us that all 2nd-class children survived.
I When a rule (such as #2) is a super rule of another rule (#1)
and the former has the same or a lower lift, the former rule
(#2) is considered to be redundant.
I Other redundant rules in the above result are rules #4, #7
and #8, compared respectively with #3, #6 and #5.
39 / 68
Remove Redundant Rules
## find redundant rules
subset.matrix <- is.subset(rules.surv.sorted, rules.surv.sorted)
subset.matrix[lower.tri(subset.matrix, diag = T)] <- F
redundant <- colSums(subset.matrix) >= 1

## which rules are redundant


redundant %>% which()
## {Class=2nd,Sex=Female,Age=Child,Survived=Yes}
## 2
## {Class=1st,Sex=Female,Age=Adult,Survived=Yes}
## 4
## {Class=Crew,Sex=Female,Age=Adult,Survived=Yes}
## 7
## {Class=2nd,Sex=Female,Age=Adult,Survived=Yes}
## 8

## remove redundant rules


rules.surv.pruned <- rules.surv.sorted[!redundant]

40 / 68
Remaining Rules
rules.surv.pruned %>% inspect() ## print rules
## lhs rhs support confidence lift...
## [1] {Class=2nd, ...
## Age=Child} => {Survived=Yes} 0.011 1.000 3.096...
## [2] {Class=1st, ...
## Sex=Female} => {Survived=Yes} 0.064 0.972 3.010...
## [3] {Class=2nd, ...
## Sex=Female} => {Survived=Yes} 0.042 0.877 2.716...
## [4] {Class=Crew, ...
## Sex=Female} => {Survived=Yes} 0.009 0.870 2.692...
## [5] {Class=2nd, ...
## Sex=Male, ...
## Age=Adult} => {Survived=No} 0.070 0.917 1.354...
## [6] {Class=2nd, ...
## Sex=Male} => {Survived=No} 0.070 0.860 1.271...
## [7] {Class=3rd, ...
## Sex=Male, ...
## Age=Adult} => {Survived=No} 0.176 0.838 1.237...
## [8] {Class=3rd, ...
## Sex=Male} => {Survived=No} 0.192 0.827 1.222...
41 / 68
rules.surv.pruned[1] %>% inspect() ## print rules
## lhs rhs support confidence
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.011 1
## lift count
## [1] 3.096 24

I Did children have a higher survival rate than adults?


I Did children of the 2nd class have a higher survival rate than
other children?

42 / 68
rules.surv.pruned[1] %>% inspect() ## print rules
## lhs rhs support confidence
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.011 1
## lift count
## [1] 3.096 24

I Did children have a higher survival rate than adults?


I Did children of the 2nd class have a higher survival rate than
other children?
I The rule states only that all children of class 2 survived, but
provides no information at all about the survival rates of other
classes.

42 / 68
Find Rules about Age Groups
I Use lower thresholds to find all rules for children of different
classes
I verbose=F: suppress progress report
I minlen=3: find rules that contain at least three items
I Use lower threshholds for support and confidence
I rhs=c(...), rhs=c(...): find rules whose left/right-hand
sides are in the list
I quality(...): interestingness measures

## mine rules about class and age group


rules.age <- titanic.raw %>% apriori(control = list(verbose=F),
parameter = list(minlen=3, supp=0.002, conf=0.2),
appearance = list(default="none", rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult")))
rules.age <- sort(rules.age, by="confidence")

43 / 68
Rules about Age Groups
rules.age %>% inspect() ## print rules
## lhs rhs support
## [1] {Class=2nd,Age=Child} => {Survived=Yes} 0.010904134
## [2] {Class=1st,Age=Child} => {Survived=Yes} 0.002726034
## [3] {Class=1st,Age=Adult} => {Survived=Yes} 0.089504771
## [4] {Class=2nd,Age=Adult} => {Survived=Yes} 0.042707860
## [5] {Class=3rd,Age=Child} => {Survived=Yes} 0.012267151
## [6] {Class=3rd,Age=Adult} => {Survived=Yes} 0.068605179
## confidence lift count
## [1] 1.0000000 3.0956399 24
## [2] 1.0000000 3.0956399 6
## [3] 0.6175549 1.9117275 197
## [4] 0.3601533 1.1149048 94
## [5] 0.3417722 1.0580035 27
## [6] 0.2408293 0.7455209 151

## average survival rate


titanic.raw$Survived %>% table() %>% prop.table()
## .
## No Yes
## 0.676965 0.323035 44 / 68
## rule visualisation
library(arulesViz)
rules.all %>% plot()

Scatter plot for 27 rules

1
1.25

I 1.2X-axis:
0.95
support
1.15

I Y-axis:
confidence

0.9
confidence
1.1

I Color: lift
1.05

0.85
1

0.95

lift
0.2 0.4 0.6 0.8
support

45 / 68
Items in LHS Group
2 rules: {Age=Child, Class=2nd, +1 items}

2 rules: {Class=1st, Sex=Female, +1 items}

1 rules: {Class=2nd, Sex=Female}

2 rules: {Class=Crew, Sex=Female, +1 items}

1 rules: {Age=Adult, Class=2nd, +1 items}

1 rules: {Sex=Male, Age=Adult, +1 items}

1 rules: {Sex=Male, Class=2nd}


rules.surv %>% plot(method = "grouped")

1 rules: {Class=3rd, Sex=Male, +1 items}


Grouped Matrix for 12 Rules

1 rules: {Class=3rd, Sex=Male}


RHS

{Class=1st, Sex=Female, +1 items} ⇒ {Survived=Yes}


{Survived=No}
{Survived=Yes}
Size: support
Color: lift

46 / 68
rules.surv %>% plot(method="graph",
control=list(layout=igraph::with_fr()))

Graph for 12 rules


size: support (0.006 − 0.192)
color: lift (1.222 − 3.096)

Age=Child

Class=2nd
Survived=Yes
Sex=Female
Class=Crew

Sex=Male
Age=Adult Survived=No

Class=1st

Class=3rd
47 / 68
rules.surv %>% plot(method="graph",
control=list(layout=igraph::in_circle()))

Graph for 12 rules


size: support (0.006 − 0.192)
color: lift (1.222 − 3.096)

Age=AdultSex=Male
Age=Child Sex=Female

Survived=No Class=Crew

Survived=Yes Class=3rd

Class=2nd

Class=1st

48 / 68
rules.surv %>% plot(method="paracoord",
control=list(reorder=T))

Parallel coordinates plot for 12 rules

Survived=Yes

Survived=No

Age=Adult

Class=1st

Class=Crew

Sex=Female

Class=2nd

Age=Child

Sex=Male

Class=3rd

3 2 1 rhs
Position

49 / 68
Interactive Plots and Reorder rules

rules.all %>% plot(interactive = T)

interactive = TRUE
I Selecting and inspecting one or multiple rules
I Zooming
I Filtering rules with an interesting measure

rules.surv %>% plot(method = "paracoord", control = list(reorder = T))

reorder = TRUE
I To improve visualisation by reordering rules and minimizing
crossovers
I The visualisation is likely to change from run to run.

50 / 68
Wrap Up

I Starting with a high support, to get a small set of rules quickly


I Setting constraints to left and/or right hand side of rules, to
focus on rules that you are interested in
I Digging down data to find more associations with lower
threshholds of support and confidence
I Rules of low confidence / lift can be interesting and useful.
I Be cautious when interpreting rules

51 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications

Association Rule Mining with R


Mining Association Rules
Removing Redundancy
Interpreting Rules
Visualizing Association Rules
Wrap Up

Further Readings and Online Resources

Exercise

52 / 68
Further Readings
I Association Rule Learning
https://en.wikipedia.org/wiki/Association_rule_learning
I Data Mining Algorithms In R: Apriori
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_Apriori_Algorithm
I Data Mining Algorithms In R: ECLAT
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_Eclat_Algorithm
I Data Mining Algorithms In R: FP-Growth
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Frequent_
Pattern_Mining/The_FP-Growth_Algorithm
I FP-Growth Implementation by Christian Borgelt
http://www.borgelt.net/fpgrowth.html
I Frequent Itemset Mining Implementations Repository
http://fimi.ua.ac.be/data/

53 / 68
Further Readings
I More than 20 interestingness measures, such as chi-square,
conviction, gini and leverage
Tan, P.-N., Kumar, V., and Srivastava, J. (2002). Selecting the right
interestingness measure for association patterns. In Proc. of KDD ’02,
pages 32-41, New York, NY, USA. ACM Press.
I More reviews on interestingness measures:
[Silberschatz and Tuzhilin, 1996], [Tan et al., 2002] and
[Omiecinski, 2003]
I Post mining of association rules, such as selecting interesting
association rules, visualization of association rules and using
association rules for classification [Zhao et al., 2009]
Yanchang Zhao, et al. (Eds.). “Post-Mining of Association Rules:
Techniques for Effective Knowledge Extraction”, ISBN
978-1-60566-404-0, May 2009. Information Science Reference.
I Package arulesSequences: mining sequential patterns
http://cran.r-project.org/web/packages/arulesSequences/
54 / 68
Contents
Association Rules: Concept and Algorithms
Basics of Association Rules
Algorithms: Apriori, ECLAT and FP-growth
Interestingness Measures
Applications

Association Rule Mining with R


Mining Association Rules
Removing Redundancy
Interpreting Rules
Visualizing Association Rules
Wrap Up

Further Readings and Online Resources

Exercise

55 / 68
The Mushroom Dataset I
I The mushroom dataset includes descriptions of hypothetical
samples corresponding to 23 species of gilled mushrooms ‡ .
I A csv file with 8,124 observations on 23 categorical variables:
1. class: edible=e, poisonous=p
2. cap-shape: bell=b,conical=c,convex=x,flat=f,
knobbed=k,sunken=s
3. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s
4. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y
5. bruises?: bruises=t,no=f
6. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s
7. gill-attachment: attached=a,descending=d,free=f,notched=n
8. gill-spacing: close=c,crowded=w,distant=d
9. gill-size: broad=b,narrow=n
10. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g,
green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y

56 / 68
The Mushroom Dataset II

11. stalk-shape: enlarging=e,tapering=t


12. stalk-root: bulbous=b,club=c,cup=u,equal=e,
rhizomorphs=z,rooted=r,missing=?
13. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s
14. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s
15. stalk-color-above-ring:
brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
16. stalk-color-below-ring:
brown=n,buff=b,cinnamon=c,gray=g,orange=o,
pink=p,red=e,white=w,yellow=y
17. veil-type: partial=p,universal=u
18. veil-color: brown=n,orange=o,white=w,yellow=y
19. ring-number: none=n,one=o,two=t
20. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l,
none=n,pendant=p,sheathing=s,zone=z

57 / 68
The Mushroom Dataset III

21. spore-print-color:
black=k,brown=n,buff=b,chocolate=h,green=r,
orange=o,purple=u,white=w,yellow=y
22. population: abundant=a,clustered=c,numerous=n,
scattered=s,several=v,solitary=y
23. habitat: grasses=g,leaves=l,meadows=m,paths=p,
urban=u,waste=w,woods=d


https://archive.ics.uci.edu/ml/datasets/Mushroom
58 / 68
Load Mushroom Dataset

## load mushroom data from UCI the Machine Learning Repository


url <- past0("http://archive.ics.uci.edu/ml/",
"machine-learning-databases/mushroom/agaricus-lepiota.data")

mushrooms <- read.csv(file = url, header = FALSE)


names(mushrooms) <- c("class", "cap-shape", "cap-surface",
"cap-color", "bruises", "odor", "gill-attachment", "gill-spacing",
"gill-size", "gill-color", "stalk-shape", "stalk-root",
"stalk-surface-above-ring", "stalk-surface-below-ring",
"stalk-color-above-ring", "stalk-color-below-ring",
"veil-type", "veil-color", "ring-number", "ring-type",
"spore-print-color", "population", "habitat")
table(mushrooms$class, useNA="ifany")
##
## e p
## 4208 3916

59 / 68
The Mushroom Dataset
str(mushrooms)
## 'data.frame': 8124 obs. of 23 variables:
## $ class : Factor w/ 2 levels "e","p": 2 ...
## $ cap-shape : Factor w/ 6 levels "b","c","f"...
## $ cap-surface : Factor w/ 4 levels "f","g","s"...
## $ cap-color : Factor w/ 10 levels "b","c","e...
## $ bruises : Factor w/ 2 levels "f","t": 2 ...
## $ odor : Factor w/ 9 levels "a","c","f"...
## $ gill-attachment : Factor w/ 2 levels "a","f": 2 ...
## $ gill-spacing : Factor w/ 2 levels "c","w": 1 ...
## $ gill-size : Factor w/ 2 levels "b","n": 2 ...
## $ gill-color : Factor w/ 12 levels "b","e","g...
## $ stalk-shape : Factor w/ 2 levels "e","t": 1 ...
## $ stalk-root : Factor w/ 5 levels "?","b","c"...
## $ stalk-surface-above-ring: Factor w/ 4 levels "f","k","s"...
## $ stalk-surface-below-ring: Factor w/ 4 levels "f","k","s"...
## $ stalk-color-above-ring : Factor w/ 9 levels "b","c","e"...
## $ stalk-color-below-ring : Factor w/ 9 levels "b","c","e"...
## $ veil-type : Factor w/ 1 level "p": 1 1 1 1...
## $ veil-color : Factor w/ 4 levels "n","o","w"...
## $ ring-number : Factor w/ 3 levels "n","o","t"...
60 / 68
Exercise

I From the mushroom data, find association rules that can be


used to identify the edibility of a mushroom
I Think about parameters: length of rules, minimum support,
minimum confidence
I How to find only rules relevant to edibility?
I Which interestingness measures to use?
I Any reduntant rules? How to remove them?
I What are characteristics of edible mushrooms? And
characteristics of poisonous ones?

61 / 68
Mining Association Rules from Mushroom Dataset
## find associatin rules from the mushroom dataset
rules <- apriori(mushrooms, control = list(verbose=F),
parameter = list(minlen=2, maxlen=5),
appearance = list(rhs=c("class=p", "class=e"),
default="lhs"))
quality(rules) <- round(quality(rules), digits=3)
rules.sorted <- sort(rules, by="confidence")
inspect(head(rules.sorted))
## lhs rhs support confidence
## [1] {ring-type=l} => {class=p} 0.160 1
## [2] {gill-color=b} => {class=p} 0.213 1
## [3] {odor=f} => {class=p} 0.266 1
## [4] {gill-size=b,gill-color=n} => {class=e} 0.108 1
## [5] {odor=n,stalk-root=e} => {class=e} 0.106 1
## [6] {bruises=f,stalk-root=e} => {class=e} 0.106 1
## lift count
## [1] 2.075 1296
## [2] 2.075 1728
## [3] 2.075 2160
## [4] 1.931 880
## [5] 1.931 864
62 / 68
Online Resources

I Book titled R and Data Mining: Examples and Case


Studies [Zhao, 2012]
http://www.rdatamining.com/docs/RDataMining-book.pdf
I R Reference Card for Data Mining
http://www.rdatamining.com/docs/RDataMining-reference-card.pdf
I Free online courses and documents
http://www.rdatamining.com/resources/
I RDataMining Group on LinkedIn (27,000+ members)
http://group.rdatamining.com
I Twitter (3,300+ followers)
@RDataMining

63 / 68
The End

Thanks!
Email: yanchang(at)RDataMining.com
Twitter: @RDataMining
64 / 68
How to Cite This Work

I Citation
Yanchang Zhao. R and Data Mining: Examples and Case Studies. ISBN
978-0-12-396963-7, December 2012. Academic Press, Elsevier. 256
pages. URL: http://www.rdatamining.com/docs/RDataMining-book.pdf.
I BibTex
@BOOK{Zhao2012R,
title = {R and Data Mining: Examples and Case Studies},
publisher = {Academic Press, Elsevier},
year = {2012},
author = {Yanchang Zhao},
pages = {256},
month = {December},
isbn = {978-0-123-96963-7},
keywords = {R, data mining},
url = {http://www.rdatamining.com/docs/RDataMining-book.pdf}
}

65 / 68
References I
Agrawal, R., Imielinski, T., and Swami, A. (1993).
Mining association rules between sets of items in large databases.
In Proc. of the ACM SIGMOD International Conference on Management of Data, pages 207–216,
Washington D.C. USA.

Agrawal, R. and Srikant, R. (1994).


Fast algorithms for mining association rules in large databases.
In Proc. of the 20th International Conference on Very Large Data Bases, pages 487–499, Santiago, Chile.

Chan, R., Yang, Q., and Shen, Y.-D. (2003).


Mining high utility itemsets.
In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, pages 19–26.

Dong, G. and Li, J. (1998).


Interestingness of discovered association rules in terms of neighborhood-based unexpectedness.
In PAKDD ’98: Proceedings of the Second Pacific-Asia Conference on Research and Development in
Knowledge Discovery and Data Mining, pages 72–86, London, UK. Springer-Verlag.

Freitas, A. A. (1998).
On objective measures of rule surprisingness.
In PKDD ’98: Proceedings of the Second European Symposium on Principles of Data Mining and
Knowledge Discovery, pages 1–9, London, UK. Springer-Verlag.

Han, J. (2005).
Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Han, J., Pei, J., Yin, Y., and Mao, R. (2004).


Mining frequent patterns without candidate generation.
Data Mining and Knowledge Discovery, 8:53–87.

66 / 68
References II
Liu, B. and Hsu, W. (1996).
Post-analysis of learned rules.
In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 828–834,
Portland, Oregon, USA.

Omiecinski, E. R. (2003).
Alternative interest measures for mining associations in databases.
IEEE Transactions on Knowledge and Data Engineering, 15(1):57–69.

Ras, Z. W. and Wieczorkowska, A. (2000).


Action-rules: How to increase profit of a company.
In PKDD ’00: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge
Discovery, pages 587–592, London, UK. Springer-Verlag.

Silberschatz, A. and Tuzhilin, A. (1995).


On subjective measures of interestingness in knowledge discovery.
In Knowledge Discovery and Data Mining, pages 275–281.

Silberschatz, A. and Tuzhilin, A. (1996).


What makes patterns interesting in knowledge discovery systems.
IEEE Transactions on Knowledge and Data Engineering, 8(6):970–974.

Tan, P.-N., Kumar, V., and Srivastava, J. (2002).


Selecting the right interestingness measure for association patterns.
In KDD ’02: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, pages 32–41, New York, NY, USA. ACM Press.

Zaki, M. J., Parthasarathy, S., Ogihara, M., and Li, W. (1997).


New algorithms for fast discovery of association rules.
Technical Report 651, Computer Science Department, University of Rochester, Rochester, NY 14627.

67 / 68
References III

Zhao, Y. (2012).
R and Data Mining: Examples and Case Studies, ISBN 978-0-12-396963-7.
Academic Press, Elsevier.

Zhao, Y., Zhang, C., and Cao, L., editors (2009).


Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, ISBN 978-1-60566-404-0.

Information Science Reference, Hershey, PA.

68 / 68

You might also like