0% found this document useful (0 votes)
90 views57 pages

Session5 6 (Am) PDF

The document discusses association rule mining and the Apriori algorithm. It provides examples to illustrate: 1) The Apriori algorithm generates frequent itemsets in multiple passes over the transaction data, starting with 1-itemsets and joining itemsets to generate candidate itemsets of increasing size. 2) In each pass, the algorithm counts the support for candidates and prunes any candidates with insufficient support before the next pass. 3) The algorithm uses the Apriori property that any subset of a frequent itemset must be frequent to prune the search space.

Uploaded by

gftr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views57 pages

Session5 6 (Am) PDF

The document discusses association rule mining and the Apriori algorithm. It provides examples to illustrate: 1) The Apriori algorithm generates frequent itemsets in multiple passes over the transaction data, starting with 1-itemsets and joining itemsets to generate candidate itemsets of increasing size. 2) In each pass, the algorithm counts the support for candidates and prunes any candidates with insufficient support before the next pass. 3) The algorithm uses the Apriori property that any subset of a frequent itemset must be frequent to prune the search space.

Uploaded by

gftr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

Business Analytics

Today Objective

ENHANCING DECISION MAKING


Association Analytics : A mining
Approach

Indian Institute of Management (IIM),Rohtak


Association Rule a concept of Mining

Indian Institute of Management (IIM),Rohtak


Association Rule a concept of Mining
• Basic Concepts of Association Rule Mining
A `rule’ is something like this:
If a basket contains Bread and Butter , then it also contains
Milk
Any such rule has two associated measures:
1. confidence – when the `if’ part is true, how often is the
`then’ bit true? This is the same as accuracy.
#_ _ _ _ _ _
Confidence (A )
#_ _ _
2. coverage or support – how much of the database contains
#_ _ _ _ _ _
support(A B) =
_#_ _

Indian Institute of Management (IIM),Rohtak


Association Model: Problem Satamente
• I ={i1, i2, ...., in} a set of items

• J = P(I ) set of all subsets of the set of items, elements of J are


called itemsets

• Transaction T: T is subset of I

• Data Base: set of transactions

• An association rule is an implication of the form : X-> Y,


where X, Y are disjoint subsets of I (elements of J )

• Problem: Find rules that have support and confidence greater that
user-specified minimum support and minimun confidence

Solution: Apriori Method

Indian Institute of Management (IIM),Rohtak


The Apriori Procedures
The Apriori Method is an influential method for
mining frequent item sets.

Key Concepts :
• Frequent Itemsets: The sets of item
which has minimum support (denoted
by Li for ith-Itemset).
• Join Operation: To find Lk , a set of
candidate k-itemsets is generated by
joining Lk with itself.
• Apriori Property: Any subset of
frequent itemset must be frequent.
Indian Institute of Management (IIM),Rohtak
Understanding Apriori through an Example

TID List of Items


• Consider a database, D , consisting
of 9 transactions.
T100 I1, I2, I5
• Suppose min. support count
T101 I2, I4
required is 2 (i.e. min_sup = 2/9 =
T102 I2, I3 22 % )
T103 I1, I2, I4 • Let minimum confidence required
is 70%.
T104 I1, I3
• We have to first find out the
T105 I2, I3 frequent itemset using Apriori
T106 I1, I3 algorithm.
T107 I1, I2 ,I3, I5 • Then, Association rules will be
generated using min. support &
T108 I1, I2, I3 min. confidence.

Indian Institute of Management (IIM),Rohtak


Step 1: Generating 1-itemset Frequent Pattern
Itemset Sup.Count Itemset Sup.Count
Compare candidate
Scan D for {I1} 6 support count with {I1} 6
count of each minimum support
candidate {I2} 7 {I2} 7
count
{I3} 6 {I3} 6
{I4} 2 {I4} 2
{I5} 2 {I5} 2

C1 L1

• In the first iteration of the algorithm, each item is a


member of the set of candidate.
• The set of frequent 1-itemsets, L1 , consists of the
candidate 1-itemsets satisfying minimum support.
Indian Institute of Management (IIM),Rohtak
Step 2: Generating 2-itemset Frequent Pattern [Cont.]
• To discover the set of frequent 2-itemsets, L2 , the
algorithm uses L1 Join L1 to generate a candidate set
of 2-itemsets, C2.
• Next, the transactions in D are scanned and the
support count for each candidate itemset in C2 is
accumulated (as show in the middle table in next
slide).
• The set of frequent 2-itemsets, L2 , is then
determined, consisting of those candidate 2-itemsets
in C2 having minimum support.
• Note: We haven’t used Apriori Property yet.

Indian Institute of Management (IIM),Rohtak


Step 2: Generating 2-itemset Frequent Pattern
Itemset Sup.Count
{I1} 6
{I2} 7
{I3} 6
{I4} 2
{I5} 2

Itemset Itemset Sup. Itemset Sup


Compare
Generate {I1, I2} Scan D for Count candidate Count
C2 count of {I1, I2} 4 support count {I1, I2} 4
candidates {I1, I3} each with
from L1 {I1, I4} candidate {I1, I3} 4 minimum {I1, I3} 4
support count
{I1, I5} {I1, I4} 1 {I1, I5} 2

{I2, I3} {I1, I5} 2 {I2, I3} 4

{I2, I4} {I2, I4} 2


{I2, I3} 4
{I2, I5} {I2, I5} 2
{I2, I4} 2
{I3, I4} {I2, I5} 2 L2
{I3, I5}
{I3, I4} 0
{I4, I5}
{I3, I5} 1
C2 {I4, I5} 0

C2
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern
Itemset Sup
Count
{I1, I2} 4 L2 Join L2 are joinable if first
{I1, I3} 4 k-1(First Item) items are common.
{I1, I5} 2
{I2, I3} 4
{I2, I4} 2
{I2, I5} 2
• The generation of the set of candidate 3-itemsets, C3 , involves use of
the Apriori Property.
• In order to find C3, we compute L2 Join L2.
• C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5},
{I2, I4, I5}}.
• Now, Join step is complete and Prune step will be used to reduce the
size of C3. Prune step helps to avoid heavy computation due to large Ck.

Indian Institute of Management (IIM),Rohtak


Step 3: Generating 3-itemset Frequent Pattern
Itemset Sup
Generate candidate set C3 using L2 (join step).
Count
{I1, I2} 4
Condition of joining Lk-1 and Lk-1 is that it should
{I1, I3} 4 have (K-2) elements in common. So here, for L2,
{I1, I5} 2 first element should match.
{I2, I3} 4 •The generation of the set of candidate 3-
{I2, I4} 2 itemsets, C3 , involves use of the Apriori
{I2, I5} 2
Property.
•C3 = L2 Join L2={{I1, I2, I3},{I1, I2, I5},{I1, I3, I5},{I2, I3, I4}, {I2, I3, I5},{I2, I4,I5}}.
If we go for all
•C3 = L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I2, I4},
{I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}.
•Now, Join step is complete and Prune step will be used to
reduce the size of C3. Prune step helps to avoid heavy
computation due to large Ck.
Indian Institute of Management (IIM),Rohtak
Step 3: Generating 3-itemset Frequent Pattern [Cont.]
• Based on the Apriori property that all subsets of a frequent Itemset Sup
itemset must also be frequent, we can determine that four Count
candidates cannot possibly be frequent. How ? {I1, I2} 4
• For example , lets take {I1, I2, I3}. The 2-item subsets of it {I1, I3} 4
are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I5} 2
{I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in {I2, I3} 4
C3. {I2, I4} 2
• Lets take another example of {I2, I3, I5} which shows how {I2, I5} 2
the pruning is performed. The 2-item subsets are {I2, I3}, {I2, Itemset Sup.
I5} & {I3,I5}. Count
{I1, I2} 4
• BUT, {I3, I5} is not a member of L2 and hence it is not
{I1, I3} 4
frequent violating Apriori Property. Thus We will have to
{I1, I4} 1
remove {I2, I3, I5} from C3. {I1, I5} 2
• Therefore, C3 = {{I1, I2, I3}, {I1, I2, I5}} after checking for {I2, I3} 4
all members of result of Join operation for Pruning. {I2, I4} 2
{I2, I5} 2
• Now, the transactions in D are scanned in order to determine
{I3, I4} 0
L3, consisting of those candidates 3-itemsets in C3 having
{I3, I5} 1
minimum support. {I4, I5} 0

Indian Institute of Management (IIM),Rohtak


Step 3: Generating 3-itemset Frequent Pattern

Compare
Scan D for Scan D for Itemset Sup. candidate Itemset Sup
count of Itemset count of support count
Count with min
Count
each each
candidate {I1, I2, I3} candidate {I1, I2, I3} 2 support count {I1, I2, I3} 2
{I1, I2, I5} {I1, I2, I5} 2
{I1, I2, I5} 2
C3 C3 L3

Generate candidate set C4 using L3 (join step). Condition


of joining Lk-1 and Lk-1 (K=4) is that, they should have (K-2)
elements in common. So here, for L3, first 2 elements
(items) should match.

Indian Institute of Management (IIM),Rohtak


Step 4: Generating 4-itemset Frequent Pattern
• The algorithm uses L3 Join L3 to generate a candidate
set of 4-itemsets, C4. Although the join results in {{I1,
I2, I3, I5}}, this itemset is pruned since its subset {{I2,
I3, I5}} is not frequent.
• Thus, C4 = φ , and procedure terminates, having
found all of the frequent items. This completes our
Apriori Algorithm.
• What’s Next ?
These frequent itemsets will be used to generate
strong association rules ( where strong association
rules satisfy both minimum support & minimum
confidence).

Indian Institute of Management (IIM),Rohtak


Step 5: Generating Association Rules from Frequent
Itemsets
• Procedure:
• For each frequent itemset “l”, generate all nonempty subsets of l.
• For every nonempty subset s of l, output the rule “s  (l-s)” if
support_count(l) / support_count(s) >= min_conf where
min_conf is minimum confidence threshold.

• Back To Example:
We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4},
{I2,I5}, {I1,I2,I3}, {I1,I2,I5}}.
– Lets take l = {I1,I2,I5}.
– Its all nonempty subsets are {I1,I2}, {I1,I5}, {I2,I5}, {I1}, {I2}, {I5}.

• Let confidence threshold is , say 100%.


• The resulting association rules are shown below, each listed with its
confidence.

Indian Institute of Management (IIM),Rohtak


Step 5: Generating Association Rules from Frequent
Itemsets [Cont.]
– R1: I1 ^ I2  I5
• Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50%
• R1 is Rejected.
– R2: I1 ^ I5  I2
• Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100%
• R2 is Selected.
– R3: I2 ^ I5  I1
• Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100%
• R3 is Selected
– R4: I1  I2 ^ I5
• Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33%
• R4 is Rejected.
– R5: I2  I1 ^ I5
• Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29%
• R5 is Rejected.
– R6: I5  I1 ^ I2
• Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100%
• R6 is Selected.
In this way, We have found three strong association rules.
Indian Institute of Management (IIM),Rohtak
An Example :
C1 L1
Q1.A database has four transactions . A 4 A 4 AB 4
Let min_sup=70%(LET SAY 3),and B 4 B 4 AD 3
min_conf=100%.
TID Date item_ bought C 2 C 2 BD 3
T100 10/15/99 {K, A, D, B} D 3 D 3
T200 10/15/99 {D, A, C, E, B} E 2 E 2
T300 10/19/99 {C, A, B, E} ABD 3
T400 10/22/99 {B, A, D} K 1 K 1

Therefor ,the set of all frequent item sets are {A},{B},{D},{A B},{A
D},{B D},{A B D}

Indian Institute of Management (IIM),Rohtak


An Example :
min_sup = 50%(2), min_conf = 80%: generate Strong Association Rule
Tid Item bought
T100 Sugar (A), Egg(C), Butter (D)
T200 Milk(B) , Egg(C), Bread(E)
T300 Sugar (A) , Milk (B) Egg
, (C), Bread (E)
T400 Milk (B), Bread (E)

Indian Institute of Management (IIM),Rohtak


An Example :

Sugar->egg
milk->bread
Bread->milk
Milk,egg->bread
Egg,bread->milk
Indian Institute of Management (IIM),Rohtak
Association Mining

The Titanic Dataset


• The Titanic dataset is a 4-dimensional table
with summarized information on the fate of
passengers on the Titanic according to
social class, sex, age and survival.
• To make it suitable for association rule
mining, save the “titanic.csv” in Documents
folder (the working directory)
• Load the Dataset:
titanic <- read.csv("titanic.csv", header=T)
Indian Institute of Management (IIM),Rohtak
Association Mining
View the uploaded data
>titanic

Indian Institute of Management (IIM),Rohtak


Association Mining

Creating Association Rules


rules.all <- apriori(titanic)

The minimum support threshold is 0.1 and


minimum confidence is 0.8 by default
Indian Institute of Management (IIM),Rohtak
Association Mining
Inspecting the rules created
inspect(rules.all) #shows all the rules created with given criteria

The default value in


APparameter for minlen is 1.
This means that rules with
only one item (i.e., an empty
antecedent/LHS) like
{} => {age=adult} will be
created.
These rules mean that no
matter what other items are
involved the item in the RHS
will appear with the
probability given by the rule's
confidence (which equals the
support).
If you want to avoid these
rules then use the argument
parameter=list(minlen=2)
Indian Institute of Management (IIM),Rohtak
Association Mining
Inspecting the rules created
rules.all =sort(rules.all, by="lift") #sort as per lift value
inspect(rules.all)

Indian Institute of Management (IIM),Rohtak


Association Mining

Question 1: Show rules containing “Survived” only in the RHS


rules <- apriori(titanic, parameter = list(minlen=2, supp=0.005, conf=0.8),
appearance = list(rhs=c("Survived=No", "Survived=Yes"), default="lhs"))

Indian Institute of Management (IIM),Rohtak


Association Mining
Question 1: Show rules containing “Survived” only in the RHS
inspect(rules)

Indian Institute of Management (IIM),Rohtak


Association Mining

Question 2: keep the confidence, support, and lift


values to three decimal places
quality(rules) <- round(quality(rules), digits=3)
inspect(rules)

Indian Institute of Management (IIM),Rohtak


Association Mining

Question 3: Order the rules by lift


rules.sorted <- sort(rules, by="lift")
inspect(rules.sorted)

Indian Institute of Management (IIM),Rohtak


Association Mining
Question 4: Show the rules of survived adults and
children of 1st, 2nd, and 3rd class
• rules <- apriori(titanic, parameter = list(minlen=3,
supp=0.002, conf=0.2), appearance =
list(default="none", rhs=c("Survived=Yes"),
lhs=c("Class=1st", "Class=2nd", "Class=3rd",
"Age=Child", "Age=Adult"))) #specifying both
the LHS and RHS, and hence default="none"
• rules.sorted <- sort(rules, by="confidence")
• inspect(rules.sorted)

Indian Institute of Management (IIM),Rohtak


Association Mining

Indian Institute of Management (IIM),Rohtak


Association Mining
plot(rules.sorted) Plotting rules
# This visualization method draws a two dimensional scatterplot with
different measures of interestingness (parameter "measure") on the axes
and a third measure (parameter "shading") is represented by the color of
the points.

Indian Institute of Management (IIM),Rohtak


Association Mining
Grouped matrix for rules
plot(rules.all, method = "grouped")
#Antecedents (columns) in the matrix are grouped using
clustering. Balloons in the matrix are used to represent
with what consequent the antecedents are connected.

Indian Institute of Management (IIM),Rohtak


Association Mining
Grouped matrix for rules
plot(rules.sorted, method = "grouped")

Indian Institute of Management (IIM),Rohtak


Association Mining
Matrix based visualization
plot(rules.sorted, method="matrix", measure=c("lift", "confidence"))
#Arranges the association rules as a matrix with the item sets in the
antecedents on one axis and the item sets in the consequents on the
other.

Indian Institute of Management (IIM),Rohtak


Association Mining
Matrix based visualization
plot(rules.sorted, method="matrix", measure=c("lift", "confidence"))

Indian Institute of Management (IIM),Rohtak


Association Mining
Graph based visualization
plot(rules.sorted, method="graph", control=list(type="items"))
#Represents the rules (or itemsets) as a graph

Indian Institute of Management (IIM),Rohtak


Association Mining
Graph based visualization
plot(rules.all, method="graph", control=list(type="items"))

Indian Institute of Management (IIM),Rohtak


Association Mining

Case(items.csv)
Question: Find the association rules with support = 0.22, and
confidence=0.7
Sol: Save the “items.csv” file in the working directory and load the data
item <- read.transactions("items.csv", format = "basket", sep = ",")
summary(item)
Note: read.transaction requires “arules” package to be installed and
loaded

http://www.learnbymarketin
g.com/1043/working-with-
arules-transactions-and-
read-transactions/

Indian Institute of Management (IIM),Rohtak


Association Mining

Mining rules (items.csv)


• rules.all <- apriori(item, parameter =
list(minlen=2, supp=0.22, conf=0.7))
• inspect(rules.all)

Indian Institute of Management (IIM),Rohtak


Association Mining

plot(rules.all, method="graph", control=list(type="items"))

Indian Institute of Management (IIM),Rohtak


Association Mining
Case(transactions.csv)
Question: Find the association rules with support =
0.7, and confidence=100%
Sol: Save the “transactions.csv” file in the working directory and
load the data
item <- read.transactions("transactions.csv", format = "basket", sep =
",")
summary(item)
Note: read.transaction requires “arules” package to be installed and
loaded

Indian Institute of Management (IIM),Rohtak


Association Mining

Mining rules (transactions.csv)


• rules.all <- apriori(item, parameter =
list(minlen=2, supp=0.7, conf=1))
• inspect(rules.all)

Indian Institute of Management (IIM),Rohtak


Association Mining

Case(supermarket.csv)
Question: Find the association rules with support = 0.4, and
confidence=0.95
Sol: Save the file in Documents folder (working directory)
Load the data:
• supermarket <- read.transactions("supermarket.csv",
format = "basket", sep = ",")
• summary(supermarket)
Note: format = "basket" is when you have multiple data
items
Mining the rules
• rules.all <- apriori(supermarket, parameter =
list(minlen=2, supp=0.22, conf=0.7))
• inspect(rules.all)
Indian Institute of Management (IIM),Rohtak
Association Mining

Indian Institute of Management (IIM),Rohtak


Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining

groceries1 <- read.transactions("groceries.csv",


format = "basket", sep = ",")

summary(groceries1)
transactions as itemMatrix in sparse format with
9835 rows (elements/itemsets/transactions) and
169 columns (items) and a density of 0.02609146

most frequent items:


whole milk other vegetables rolls/buns soda yogu
rt
2513 1903 1809 1715 1372
(Other)
34055

inspect(groceries1)
Indian Institute of Management (IIM),Rohtak
Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining

Display first three transaction

inspect(groceries1[1:3])

items
[1] {citrus fruit,margarine,ready so
ups,semi-finished bread}
[2] {coffee,tropical fruit,yogurt}
[3] {whole milk}

Indian Institute of Management (IIM),Rohtak


Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining

Print Frequency plot of TOP 10 item

itemFrequencyPlot(groceries1,topN=10)

Indian Institute of Management (IIM),Rohtak


Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining

Display item which support .15

itemFrequencyPlot(groceries1,support=.15)

Indian Institute of Management (IIM),Rohtak


Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining
Generate top 5 rule based on given support(.5) and confidence(.9)
m1<-apriori(groceries1,parameter = list(support=.5,confidence=.9))
inspect(m1)
summary(m1)
Set of 0 rule
Change values(.005) and (.25)
m1<-apriori(groceries1,parameter = list(support=.5,confidence=.9))
inspect(m1)
summary(m1)
Set of 663 rule
inspect(m1[1:5])
lhs rhs support confidence lift
[1] {} => {whole milk} 0.255516014 0.2555160 1.000000
[2] {cake bar} => {whole milk} 0.005592272 0.4230769 1.655775
[3] {dishes} => {other vegetables} 0.005998983 0.3410405 1.762550
[4] {dishes} => {whole milk} 0.005287239 0.3005780 1.176357
[5] {mustard} => {whole milk} 0.005185562 0.4322034 1.691492
Indian Institute of Management (IIM),Rohtak
Association Mining
Case(groceries.csv)
Display first three transaction

Print Frequency plot of TOP 10 item

Display item which support .15

Generate top 5 rule based on given support(.5) and confidence(.9)

Sort top four rule based upon lift value

Indian Institute of Management (IIM),Rohtak


Association Mining

inspect(sort(m1,by="lift")[1:4])
lhs rhs support confidence
[1] {citrus fruit,other vegetables,whole milk} => {root vegetables} 0.005795628
0.4453125
[2] {butter,other vegetables} => {whipped/sour cream} 0.005795628
0.2893401
[3] {herbs} => {root vegetables} 0.007015760 0.4312500
[4] {citrus fruit,pip fruit} => {tropical fruit} 0.005592272 0.4044118
lift
[1] 4.085493
[2] 4.036397
[3] 3.956477
[4] 3.854060

Indian Institute of Management (IIM),Rohtak


Thank you !!!
Indian Institute of Management (IIM),Rohtak

You might also like