An Approach of Improvisation in Efficiency of Apriori Algorithm
An Approach of Improvisation in Efficiency of Apriori Algorithm
Algorithm
1. INTRODUCTION
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Association rule problems are in discussion from 1993 and many researchers have
worked on it to optimize the original algorithm such as doing random sampling,
declining rules, changing storing framework etc. [1]. We find association rules from a
huge amount of data to identify the relationships in items which tells about human
behavior of buying set of items. There is always a particular pattern followed by
humans during buying the set of items.
In data mining, unknown dependency in data is found in association rule mining
and then rules between the items are found [3]. Association rule mining problem is
defined as follows.
DBT = {T1, T2... TN} is a database of N T transactions.
PrePrints
TID Items
1 CPU, Monitor
2 CPU, Keyboard, Mouse, UPS
3 Monitor, Keyboard, Mouse, Motherboard
4 CPU, Monitor, Keyboard, Mouse
5 CPU, Monitor, Keyboard, Motherboard
Rule Evaluation:
Support: It is defined as rate of occurrence of an itemset in a transaction database.
Support (Keyboard Mouse) =
No. Of transactions containing both Keyboard and Mouse
No. Of total transactions
Confidence: For all transactions, it defines the ratio of data items which contains Y in
the items that contains X.
Confidence (Keyboard Mouse) =
No. Of transactions containing Keyboard and Mouse
No. Of transactions (containing Keyboard)
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
3
Algorithm Apriori_algo(Lk)
1. L1= {frequent-1 item-sets};
2. for (k=2; Lk-1≠Φ; k++) {
3. Ck= generate_Apriori(Lk-1); //New candidates
4. forall transactions t ϵ D do begin
5. Ct=subset(Ck,t); //Candidates contained in t
6. forall candidates c ϵ Ct do
7. c.count++;
8. }
9. Lk={c ϵ Ck | c.count≥minsup}
10. end for
11. Answer=UkLk
Algorithm. 1. Apriori Algorithm[6]
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
3. select p.I1,p.I2,.....p.Ik-1,q.Ik-1 from p, q where p.I1=q.I1...p.Ik-2= q.Ik-2,p.Ik-1<q.Ik-1;
4. forall itemsets c ϵ Ck do
5. forall { s ⊃ (k-1) of c) do
6. if (s ∉ Lk-1) then
7. from Ck , delete c
Algorithm. 2. Apriori-Gen Algorithm[6]
PrePrints
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
5
Large number of candidate and frequent item sets are to be handled and
results in increased cost and waste of time.
Example: if number of frequent (k-1) items is 104 then almost 107 Ck need to
be generated and tested [2]. So scanning of a database is done many times to
find Ck
PrePrints
Below section will give an idea to improve apriori efficiency along with example and
algorithm.
In this approach to improve apriori algorithm efficiency, we focus on reducing the time
consumed for Ck generation.
In the process to find frequent item sets, first size of a transaction (S T) is found for
each transaction in DB and maintained. Now, find L1 containing set of items, support
value for each item and transaction ids containing the item. Use L1 to generate L2,
L3… along with decreasing the database size so that time reduces to scan the
transaction from the database.
To generate C2(x,y) (items in Ck are x and y), do L(k-1) * L(k-1) . To find L2 from
C2, instead of scanning complete database and all transactions, we remove transaction
where ST < k (where k is 2, 3…) and also remove the deleted transaction from L 1 as
well. This helps in reducing the time to scan the infrequent transactions from the
database.
Find minimum support from x and y and get transaction ids of minimum support
count item from L1. Now, Ck is scanned for specific transactions only (obtained above)
and from decreased DB size. Then, L2 is generated by C2 where support of Ck >=
min_supp.
C3(x,y,z), L3 and so on is generated repeating above steps until no frequent items
sets can be discovered.
Algorithm Apriori
Input: transactions database, D
Minimum support, min_sup
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Output Lk: frequent itemsets in D
1. find ST //for each transaction in DB
2. L1=find frequent_1_itemset (D)
3. L1= find frequent_1_itemset (D)
4. L1+=get_txn_ids(D)
5. for (k=2;Lk-1≠Φ ; k++){
6. Ck=generate_candidate (Lk-1)
7. x= item_min_sup(Ck, L1) //find item from Ck(a,b) which has minimum support
using L1
8. target =get_txn_ids(x) //get transactions for each item
PrePrints
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
7
PrePrints
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
5. EXPERIMENTAL EXAMPLE
Below is the transaction database (D) having 10 transactions and min_sup=3. Size of
transaction (ST) is calculated for each transaction. (Refer figure 3).
PrePrints
All the transactions are scanned to get frequent-1-itemset, L1. It contains items,
respective support count and transactions from D which contain the items. Infrequent
candidates’ i.e. itemsets whose support < min_sup are eliminated or deleted. (Refer
Figure 4 and Figure 5)
Fig.4. Candidate-1-itemset
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
9
PrePrints
Fig.5. Frequent-1-itemset
L2 is shown in Figure 8.
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Fig.7. Frequent-1-itemset (updated)
PrePrints
Fig.8. Frequent-2-itemset
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
11
PrePrints
Fig.11. Frequent-3-itemset
6. COMPARATIVE ANALYSIS
We have counted the number of transactions that are scanned to find L1, L2 and L3 for
our given example and below figure shows the difference in count of transactions
scanned by using original apriori algorithm and our proposed idea.
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
For k=1, number of transactions scanned is same for both classical apriori and our
proposed idea but with the increase in k, count of transactions decrease. Refer below
figure.
PrePrints
7. CONCLUSION
REFERENCES
1. J. Han and M. Kamber, Conception and Technology of Data Mining, Beijing: China
Machine Press, 2007.
2. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery
in databases,” vol. 17, no. 3, AI magazine, 1996, pp. 37.
3. S. Rao, R. Gupta, “Implementing Improved Algorithm over APRIORI Data Mining
Association Rule Algorithm”, International Journal of Computer Science And Technology,
pp. 489-493, Mar. 2012
4. H. H. O. Nasereddin, “Stream data mining,” International Journal of Web Applications, vol.
1, no. 4,pp. 183–190, 2009.
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
13
5. M. Halkidi, “Quality assessment and uncertainty handling in data mining process,” in Proc,
EDBT Conference, Konstanz, Germany, 2000.
6. Rakesh Agarwal, Ramakrishna Srikant, “Fast Algorithm for mining association rules”
VLDB Conference Santiago, Chile, 1994, pp 487-499.
PrePrints
PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015