0% found this document useful (0 votes)
296 views9 pages

Assignment 03

The document discusses association rule mining and the Apriori algorithm. It defines key concepts like support, confidence, antecedent, consequent and explains the Apriori and reverse Apriori principles. The two main steps of association rule mining are identified as frequent itemset generation and rule generation. The algorithm for frequent itemset generation using Apriori is provided. Techniques to improve the efficiency of Apriori like hash-based approaches and the Direct Hashing and Pruning (DHP) algorithm are also discussed.

Uploaded by

dilhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
296 views9 pages

Assignment 03

The document discusses association rule mining and the Apriori algorithm. It defines key concepts like support, confidence, antecedent, consequent and explains the Apriori and reverse Apriori principles. The two main steps of association rule mining are identified as frequent itemset generation and rule generation. The algorithm for frequent itemset generation using Apriori is provided. Techniques to improve the efficiency of Apriori like hash-based approaches and the Direct Hashing and Pruning (DHP) algorithm are also discussed.

Uploaded by

dilhani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ST 402 - Statistical Data Mining

Weekly Assignment 3
S/15/809

1. Explain what is association rule mining?

Association rules are "if-then" statements, which help to show the probability of relationships
between data items, within large data sets.
Association rule mining is that the data mining process of finding the principles which will
govern associations and causal objects between sets of things.
Association rule mining has a number of applications.

2. Define support and confidence of association rules

Support:
Support indicates how frequently the if/then relationship appears in the database.
Probability that a transaction contains X and Y.
𝑆𝑢𝑝𝑝𝑜𝑟𝑡 = 𝑃(𝑋 ∪ 𝑌)

 Support(S): It can be define as it is probability or percentage of transaction in D that


contain 𝐴 ∪ 𝐵 or in other word we can say that it is ratio of occurrences of items and
total number of transactions.
 𝑆𝑢𝑝𝑝𝑜𝑟𝑡 (𝐴 → 𝐵) = 𝑃(𝐴 ∪ 𝐵)
𝑆(𝐴 → 𝐵)= Amount of transaction A & B
Total Transaction

Confidence:
Confidence tells about the number of times these relationships have been found to be
true.
Conditional probability that a transaction having X also contains Y.
𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 = 𝑃(𝑌 |𝑋)
 Confidence(C): It can be define as it is conditional probability or percentage of the
transaction containing A also contain B.
 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵) = 𝑃(𝐵|𝐴)
 𝐶𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒(𝐴 → 𝐵) = 𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴 ∪ 𝐵)|𝑆𝑢𝑝𝑝𝑜𝑟𝑡(𝐴)

Both confidence and support should be large.

3. Explain what is antecedent and consequent of an association rule


An association rule has two parts
i) Antecedent (if): An antecedent is an item found within the data.
ii) Consequent (then): A consequent is an item found in combination with the antecedent.
4. Clearly explain what is apriori principle and hence explain what is reverse apriori principle.
 All non-empty subset of frequent item set must be frequent. The key concept of Apriori
algorithm is its anti-monotonicity of support measure. Apriori assumes that,
“All subsets of a frequent item set must be frequent (Apriori property).
If an item set is infrequent, all its supersets will be infrequent.”
 Reverse apriori principle is opposite to the apriori principle. That is, every item sets
contain a non-frequent item set are non-frequent.

5. What is the main benefit of using reverse apriori principle on association rule mining?
 RAA works well when the frequent item sets are found in the beginning.
 This algorithm can perform good for higher datasets and poor for lower dataset.
 It generates the candidate set first and then the frequent item sets.

6. Clearly explain why the support of a set is monotonically decrease if the items are added to the set.

7. What are the main two steps of association rule mining?


1. Frequent Itemset Generation: - find all itemsets whose support is greater than or equal to the
min-sup
2. Rule generation: generate strong association rules from the frequent itemset whose
confidence greater than or equal to min-conf.

8. Clearly write the algorithm for frequent item set generation.


Let 𝐶𝑘 denote the set of candidate k-itemsets and 𝐹𝑘 denote the set of frequent k-itemsets:

𝐿1 = {𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑡 𝑖𝑡𝑒𝑚𝑠};
For (𝑘 = 1; 𝐿𝑘 ≠ ∅; 𝑘 + +) do begin
𝐶𝑘+1 : Candidates generated from 𝐿𝑘 ;
For each transaction t in database do increment the count of all candidates in 𝐶𝑘+1 that are
contained in t
𝐿𝑘+1 = Candidates in 𝐶𝑘+1 with min_support
End
Return 𝑈𝑘 𝐿𝑘

9. Clearly explain self-join and prune steps of candidate item-set generation.

Join: meaning k-itemset is made to self-join with itself to generate (k+1)-itemsets.


Ex: 𝐽𝑜𝑖𝑛(𝐶𝐷 → 𝐴𝐵, 𝐵𝐷 → 𝐴𝐶) would produce the candidate rule 𝐷 → 𝐴𝐵𝐶.

Prune: have resulting set from join is filtered with minimum support threshold (based on apriori
property).
Ex: 𝑃𝑟𝑢𝑛𝑒 𝑟𝑢𝑙𝑒 𝐷 → 𝐴𝐵𝐶 if its subset 𝐴𝐷 → 𝐵𝐶 does not have high confidence.
10. A database has five transactions. Let min_sup =60% and min_conf = 80%.
TID Items Bought
60 𝑥
T001 {M,O,N,K,E,Y} = 𝑥=3
100 5
T002 {D,O,N,K,E,Y}
T003 {M,A,K,E} 80 𝑦
= 𝑦=4
T004 {M,U,C,K,Y} 100 5
T005 {C,O,O,K,I,E}

a. Find all frequent item-sets using Apriori algorithm

Iteration 01:
C1:
Itemset Sup
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3

L1:
Itemset Sup
E 4
K 5
M 3
O 4
Y 3
Iteration 02:
C2:
Itemset Sup
E,K 4
E,M 2
E,O 3
E,Y 2
K,M 3
K,O 3
K,Y 3
M,O 1
M,Y 2
O,Y 2

L2:
Itemset Sup
E,K 4
E,O 3
K,M 3
K,O 3
K,Y 3

Iteration 03:
C2:
Itemset Sup
E,K,O 3
K,M,O 1
K,M,Y 2

L3:
Itemset Sup
E,K,O 3

STOP.
b. List all of the strong association rules (with support s and confidence c) matching the following
meta-rule, where X is a variable representing customers, and item denotes variables representing
items (e.g., “A”, “B”, etc.):

So, we have only one frequent itemset of length 3,


So, for itemset [E, O, K] there are three possibilities
1) [E, O] → K
CONF = support ([E, O, K])/support ([E, O])
=3/3 = 100%
Hence [E, O] → K is an association rule
2) [O, K] → E
CONF = support ([E, O, K])/support ([O, K])
= 3/3 = 100%
Hence [O, K] → E is an association rule
3) [E, K] → O
CONF = support ([E, O, K])/support ([E, K])
= 3/4 = 75%
As 75% < 80%, [E, K] → O is not an association rule.

Association Rule:
1. [E, K] → 0 = 3/4 = 75%
2. [K, O] → E = 3/3 = 100%
3. [E, O] → K = 3/3 = 100%
4. E → [K, O] = 3/4 = 75%
5. K → [E, O] = 3/5 = 60%
6. O → [E, K] = 3/3 = 100%

∴ Rule no. 5 is discarded because confidence ≥70%

So, Rule 1, 2,3,4,6 are selected

c. Use the R library arules to find strong association rules for the above data base of items and
compare it with the manual results obtained in above step.

∀𝑥∈𝑡𝑟𝑎𝑛𝑠𝑎𝑐𝑡𝑖𝑜𝑛,𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚1)^𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚2))𝑏𝑢𝑦𝑠(𝑋,𝑖𝑡𝑒𝑚3) [𝑠,𝑐]
11. Explain how to improve efficiency of the apriori algorithm using Hash based techniques.
As we know that apriori algorithm has some weakness so to reduce the span of the hopeful
k-item sets, 𝐶𝑘 hashing technique is used. Our hash based Apriori execution, utilizes the data
structure that specifically speaks to a hash table. Specifically the 2-itemsets, since that is the
way to enhancing execution. This calculation utilizes a hash based procedure to minimize the
quantity of applicant item sets created in the 1st pass. It is guaranteed that the quantity of item
sets in 𝐶2 produced utilizing hashing can be smalled, so that the output required to decide 𝐿2 is
more efficient.
For instance, while scanning every transaction in the database to create the
Frequent1-itemsets, 𝐿1 , from the candidate 1-itemsets in 𝐶1 , we can produce the majority of the
2-itemsets for every transaction, hash(i.e) map into the diverse bucket of a hash table structure,
and increment the complementary bucket count. A 2-itemset whose complementary bucket
count in the hash table is below the min_sup threshold cannot be frequent and thus should be
reduce from the candidate set. So hash based apriori reduce the no of candidate k-item set.

Steps:
1. Scan database transaction.generate frequent-1 item set. Then after generate frequent-2 item
set.
2. Let take hash table of size.
3. For each bucket appoint a candidate sets utilizing the ASCII estimations of the itemsets.
4. Each bucket in the hash table has a count, which is expanded by 1 every item an item set is
hashed to that bucket.
5. If the bucket count is satisfy the min_sup threshold value than bit vector is set to 1, otherwise
is set to 0.
6. The candidate pairs that bit vector bit is not set that are removed.

12. Clearly explain Direct Hashing and Pruning (DHP) Algorithm.


 Is based on apriori algorithm.
 Use hashing technique to filter unnecessary 𝐶𝑘+1
 Also the DHP algorithm proposed in prunes the transactions.

13. Discuss advantages of DHP algorithm over apriori algorithm.


 The drawback of the DHP algorithm is that the hash table is in competition for memory space
with the hash tree used to hold the counts for the itemsets.
 Experiments performed have showed that as the size of the database grows, the DHP algorithm
significantly outperforms the apriori algorithm. However, the performance of the DHP algorithm highly
depends on the hash table.
14. Manually workout the example given in question 10 using DHP algorithm. Clearly state all the
steps.

15. Explain FP-Growth (Frequent Pattern Growth) algorithm for association rule generation.
1. The first step is to scan the database to find the occurrences of the itemsets in the
database. This step is the same as the first step of Apriori. The count of 1-itemsets in the
database is called support count or frequency of 1-itemset.
2. The second step is to construct the FP tree. For this, create the root of the tree. The root
is represented by null.
3. The next step is to scan the database again and examine the transactions. Examine the
first transaction and find out the itemset in it. The itemset with the max count is taken
at the top, the next itemset with lower count and so on. It means that the branch of the
tree is constructed with transaction itemsets in descending order of count.
4. The next transaction in the database is examined. The itemsets are ordered in
descending order of count. If any itemset of this transaction is already present in
another branch (for example in the 1st transaction), then this transaction branch would
share a common prefix to the root. This means that the common itemset is linked to the
new node of another itemset in this transaction.
5. Also, the count of the itemset is incremented as it occurs in the transactions. Both the
common node and new node count is increased by 1 as they are created and linked
according to transactions.
6. The next step is to mine the created FP Tree. For this, the lowest node is examined first
along with the links of the lowest nodes. The lowest node represents the frequency
pattern length 1. From this, traverse the path in the FP Tree. This path or paths are
called a conditional pattern base.
7. Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The
itemsets meeting the threshold support are considered in the Conditional FP Tree.
8. Frequent Patterns are generated from the Conditional FP Tree.

16. Discuss how the efficiency of apriori algorithm is improved using FP-Growth algorithm.
1. This algorithm needs to scan the database only twice when compared to Apriori which scans the
transactions for each iteration.
2. The pairing of items is not done in this algorithm and this makes it faster.
3. The database is stored in a compact version in memory.
4. It is efficient and scalable for mining both long and short frequent patterns.
17. Manually workout the example given in question 10 using FP-Growth algorithm. Clearly state all
the steps.

Table 01:
TID Items Bought
T001 {M,O,N,K,E,Y}
T002 {D,O,N,K,E,Y}
T003 {M,A,K,E}
T004 {M,U,C,K,Y}
T005 {C,O,O,K,I,E}

Table 02:
Itemset Sup
A 1
C 2
D 1
E 4
I 1
K 5
M 3
N 2
O 4
U 1
Y 3

Table 03:
Itemset Sup
K 5
E 4
M 3
O 3
Y 3
Null

K: 5

M: 1
E: 4

M: 2 O: 2 Y: 1

O: 1 Y: 1

Y: 1

18. Use the “Online Retail.xls” data set to extract the first 10 valid rules satisfying the minimum
support = 0.001 and minimum confidence = 0.80 using R package “arules”.

You might also like