Assignment 03
Assignment 03
Weekly Assignment 3
S/15/809
Association rules are "if-then" statements, which help to show the probability of relationships
between data items, within large data sets.
Association rule mining is that the data mining process of finding the principles which will
govern associations and causal objects between sets of things.
Association rule mining has a number of applications.
Support:
Support indicates how frequently the if/then relationship appears in the database.
Probability that a transaction contains X and Y.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 𝑃(𝑋 ∪ 𝑌)
Confidence:
Confidence tells about the number of times these relationships have been found to be
true.
Conditional probability that a transaction having X also contains Y.
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝑌 |𝑋)
Confidence(C): It can be define as it is conditional probability or percentage of the
transaction containing A also contain B.
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵) = 𝑃(𝐵|𝐴)
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵) = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐵)|𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)
5. What is the main benefit of using reverse apriori principle on association rule mining?
RAA works well when the frequent item sets are found in the beginning.
This algorithm can perform good for higher datasets and poor for lower dataset.
It generates the candidate set first and then the frequent item sets.
6. Clearly explain why the support of a set is monotonically decrease if the items are added to the set.
𝐿1 = {𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠};
For (𝑘 = 1; 𝐿𝑘 ≠ ∅; 𝑘 + +) do begin
𝐶𝑘+1 : Candidates generated from 𝐿𝑘 ;
For each transaction t in database do increment the count of all candidates in 𝐶𝑘+1 that are
contained in t
𝐿𝑘+1 = Candidates in 𝐶𝑘+1 with min_support
End
Return 𝑈𝑘 𝐿𝑘
Prune: have resulting set from join is filtered with minimum support threshold (based on apriori
property).
Ex: 𝑃𝑟𝑢𝑛𝑒 𝑟𝑢𝑙𝑒 𝐷 → 𝐴𝐵𝐶 if its subset 𝐴𝐷 → 𝐵𝐶 does not have high confidence.
10. A database has five transactions. Let min_sup =60% and min_conf = 80%.
TID Items Bought
60 𝑥
T001 {M,O,N,K,E,Y} = 𝑥=3
100 5
T002 {D,O,N,K,E,Y}
T003 {M,A,K,E} 80 𝑦
= 𝑦=4
T004 {M,U,C,K,Y} 100 5
T005 {C,O,O,K,I,E}
Iteration 01:
C1:
Itemset Sup
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3
L1:
Itemset Sup
E 4
K 5
M 3
O 4
Y 3
Iteration 02:
C2:
Itemset Sup
E,K 4
E,M 2
E,O 3
E,Y 2
K,M 3
K,O 3
K,Y 3
M,O 1
M,Y 2
O,Y 2
L2:
Itemset Sup
E,K 4
E,O 3
K,M 3
K,O 3
K,Y 3
Iteration 03:
C2:
Itemset Sup
E,K,O 3
K,M,O 1
K,M,Y 2
L3:
Itemset Sup
E,K,O 3
STOP.
b. List all of the strong association rules (with support s and confidence c) matching the following
meta-rule, where X is a variable representing customers, and item denotes variables representing
items (e.g., “A”, “B”, etc.):
Association Rule:
1. [E, K] → 0 = 3/4 = 75%
2. [K, O] → E = 3/3 = 100%
3. [E, O] → K = 3/3 = 100%
4. E → [K, O] = 3/4 = 75%
5. K → [E, O] = 3/5 = 60%
6. O → [E, K] = 3/3 = 100%
c. Use the R library arules to find strong association rules for the above data base of items and
compare it with the manual results obtained in above step.
∀𝑥∈𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛,𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚1)^𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚2))𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚3) [𝑠,𝑐]
11. Explain how to improve efficiency of the apriori algorithm using Hash based techniques.
As we know that apriori algorithm has some weakness so to reduce the span of the hopeful
k-item sets, 𝐶𝑘 hashing technique is used. Our hash based Apriori execution, utilizes the data
structure that specifically speaks to a hash table. Specifically the 2-itemsets, since that is the
way to enhancing execution. This calculation utilizes a hash based procedure to minimize the
quantity of applicant item sets created in the 1st pass. It is guaranteed that the quantity of item
sets in 𝐶2 produced utilizing hashing can be smalled, so that the output required to decide 𝐿2 is
more efficient.
For instance, while scanning every transaction in the database to create the
Frequent1-itemsets, 𝐿1 , from the candidate 1-itemsets in 𝐶1 , we can produce the majority of the
2-itemsets for every transaction, hash(i.e) map into the diverse bucket of a hash table structure,
and increment the complementary bucket count. A 2-itemset whose complementary bucket
count in the hash table is below the min_sup threshold cannot be frequent and thus should be
reduce from the candidate set. So hash based apriori reduce the no of candidate k-item set.
Steps:
1. Scan database transaction.generate frequent-1 item set. Then after generate frequent-2 item
set.
2. Let take hash table of size.
3. For each bucket appoint a candidate sets utilizing the ASCII estimations of the itemsets.
4. Each bucket in the hash table has a count, which is expanded by 1 every item an item set is
hashed to that bucket.
5. If the bucket count is satisfy the min_sup threshold value than bit vector is set to 1, otherwise
is set to 0.
6. The candidate pairs that bit vector bit is not set that are removed.
15. Explain FP-Growth (Frequent Pattern Growth) algorithm for association rule generation.
1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root. This means that the common itemset is linked to the
new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths are
called a conditional pattern base.
7. Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.
16. Discuss how the efficiency of apriori algorithm is improved using FP-Growth algorithm.
1. This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
17. Manually workout the example given in question 10 using FP-Growth algorithm. Clearly state all
the steps.
Table 01:
TID Items Bought
T001 {M,O,N,K,E,Y}
T002 {D,O,N,K,E,Y}
T003 {M,A,K,E}
T004 {M,U,C,K,Y}
T005 {C,O,O,K,I,E}
Table 02:
Itemset Sup
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3
Table 03:
Itemset Sup
K 5
E 4
M 3
O 3
Y 3
Null
K: 5
M: 1
E: 4
M: 2 O: 2 Y: 1
O: 1 Y: 1
Y: 1
18. Use the “Online Retail.xls” data set to extract the first 10 valid rules satisfying the minimum
support = 0.001 and minimum confidence = 0.80 using R package “arules”.