0% found this document useful (0 votes)
45 views13 pages

An Approach of Improvisation in Efficiency of Apriori Algorithm

This document proposes an approach to improve the efficiency of the Apriori algorithm by reducing the size of the database and the number of scans. The Apriori algorithm is an important and commonly used algorithm for association rule mining but suffers from inefficiency due to multiple scans of the database. The proposed approach aims to address this issue by reducing database size and time spent scanning transactions.

Uploaded by

Padhma M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views13 pages

An Approach of Improvisation in Efficiency of Apriori Algorithm

This document proposes an approach to improve the efficiency of the Apriori algorithm by reducing the size of the database and the number of scans. The Apriori algorithm is an important and commonly used algorithm for association rule mining but suffers from inefficiency due to multiple scans of the database. The proposed approach aims to address this issue by reducing database size and time spent scanning transactions.

Uploaded by

Padhma M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

An Approach of Improvisation in Efficiency of Apriori

Algorithm

Sakshi Aggarwal1, Ritu Sindhu2


1
SGT Institute of Engineering & Technology,
Gurgaon, Haryana
[email protected]
2
SGT Institute of Engineering & Technology,
Gurgaon, Haryana
PrePrints

[email protected]

Abstract. Association rule mining has a great importance in data mining.


Apriori is the key algorithm in association rule mining. Many approaches are
proposed in past to improve Apriori but the core concept of the algorithm is same
i.e. support and confidence of itemsets and previous studies finds that classical
Apriori is inefficient due to many scans on database. In this paper, we are
proposing a method to improve Apriori algorithm efficiency by reducing the
database size as well as reducing the time wasted on scanning the transactions.
Keywords: Apriori algorithm, Support, Frequent Itemset, Association rules,
Candidate Item Sets.

1. INTRODUCTION

Extracting relevant information by exploitation of data is called Data Mining. There is


an increasing need to extract valid and useful information by business people from
large datasets [2]; here data mining achieves its goal. Thus, data mining has its
importance to discover hidden patterns from huge data stored in databases, OLAP
(Online Analytical Process), data warehouse etc. [5]. This is the only reason why data
mining is also known as KDD (Knowledge Discovery in Databases). [4] KDD’s
techniques are used to extract the interesting patterns. Steps of KDD process are
cleaning of data (data cleaning), selecting relevant data, transformation of data, data
pre-processing, mining and pattern evaluation.

2. ASSOCIATION RULE MINING

Association rule mining has its importance in fields of artificial intelligence,


information science, database and many others. Data volumes are dramatically
increasing by day-to-day activities. Therefore, mining the association rules from
massive data is in the interest for many industries as theses rules help in decision-
making processes, market basket analysis and cross marketing etc.

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Association rule problems are in discussion from 1993 and many researchers have
worked on it to optimize the original algorithm such as doing random sampling,
declining rules, changing storing framework etc. [1]. We find association rules from a
huge amount of data to identify the relationships in items which tells about human
behavior of buying set of items. There is always a particular pattern followed by
humans during buying the set of items.
In data mining, unknown dependency in data is found in association rule mining
and then rules between the items are found [3]. Association rule mining problem is
defined as follows.
DBT = {T1, T2... TN} is a database of N T transactions.
PrePrints

Each transaction consists of I, where I= {i 1, i2, i3….iN} is a set of all items. An


association rule is of the form A⇒B, where A and B are item sets, A⊆I, B⊆I, A∩B=∅.
The whole point of an algorithm is to extract the useful information from these
transactions.
For example: Consider below table containing some transactions:

Table 1. Example of transactions in a database

TID Items
1 CPU, Monitor
2 CPU, Keyboard, Mouse, UPS
3 Monitor, Keyboard, Mouse, Motherboard
4 CPU, Monitor, Keyboard, Mouse
5 CPU, Monitor, Keyboard, Motherboard

Example of Association Rules:


{Keyboard}  {Mouse},
{CPU, Monitor}  {UPS, Motherboard},
{CPU, Mouse}  {Monitor},
A B is an association rule (A and B are itemsets).
Example: {Monitor, Keyboard}  {Mouse}

Rule Evaluation:
Support: It is defined as rate of occurrence of an itemset in a transaction database.
Support (Keyboard  Mouse) =
No. Of transactions containing both Keyboard and Mouse
No. Of total transactions

Confidence: For all transactions, it defines the ratio of data items which contains Y in
the items that contains X.
Confidence (Keyboard  Mouse) =
No. Of transactions containing Keyboard and Mouse
No. Of transactions (containing Keyboard)

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
3

Itemset: One or more items collectively is called an itemset. Example: {Monitor,


Keyboard, Mouse}. K-itemset contains k-items.

Frequent Itemset: For a frequent item set:


SI >= min_sup
where I is an itemset, min_sup is minimum support threshold and S represent the
support for an itemset.
PrePrints

3. CLASSICAL APRIORI ALGORITHM

Using an iterative approach, in each iteration Apriori algorithm generates candidate


item-sets by using large itemsets of a previous iteration. [2]. Basic concept of this
iterative approach is as follows:

Algorithm Apriori_algo(Lk)
1. L1= {frequent-1 item-sets};
2. for (k=2; Lk-1≠Φ; k++) {
3. Ck= generate_Apriori(Lk-1); //New candidates
4. forall transactions t ϵ D do begin
5. Ct=subset(Ck,t); //Candidates contained in t
6. forall candidates c ϵ Ct do
7. c.count++;
8. }
9. Lk={c ϵ Ck | c.count≥minsup}
10. end for
11. Answer=UkLk
Algorithm. 1. Apriori Algorithm[6]

Above algorithm is the apriori algorithm. In above, database is scanned to find


frequent 1-itemsets along with the count of each item. Frequent itemset L1 is created
from candidate item set where each item satisfies minimum support. In next each
iteration, set of item sets is used as a seed which is used to generate next set of large
itemsets i.e candidate item sets (candidate generation) using generate_Apriori
function.
Lk-1 is input to generate_Apriori function and returns C k. Join step joins Lk-1 with
another Lk-1 and in prune step, item sets c ϵ Ck are deleted such that (k-1) is the subset
of “c” but not in Lk-1 of Ck-1.

Algorithm generate_Apriori (Lk)


1. insert into Ck
2. p =Lk-1 ,q= Lk-1

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
3. select p.I1,p.I2,.....p.Ik-1,q.Ik-1 from p, q where p.I1=q.I1...p.Ik-2= q.Ik-2,p.Ik-1<q.Ik-1;
4. forall itemsets c ϵ Ck do
5. forall { s ⊃ (k-1) of c) do
6. if (s ∉ Lk-1) then
7. from Ck , delete c
Algorithm. 2. Apriori-Gen Algorithm[6]
PrePrints

Fig.1. Apriori Algorithm Steps

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
5

3.1. Limitations of Apriori Algorithm

 Large number of candidate and frequent item sets are to be handled and
results in increased cost and waste of time.
Example: if number of frequent (k-1) items is 104 then almost 107 Ck need to
be generated and tested [2]. So scanning of a database is done many times to
find Ck


PrePrints

Apriori is inefficient in terms of memory requirement when large numbers of


transactions are in consideration.

4. PROPOSED ENHANCEMENT IN EXISTING APRIORI


ALGORITHM

Below section will give an idea to improve apriori efficiency along with example and
algorithm.

4.1. Improvement of Apriori

In this approach to improve apriori algorithm efficiency, we focus on reducing the time
consumed for Ck generation.
In the process to find frequent item sets, first size of a transaction (S T) is found for
each transaction in DB and maintained. Now, find L1 containing set of items, support
value for each item and transaction ids containing the item. Use L1 to generate L2,
L3… along with decreasing the database size so that time reduces to scan the
transaction from the database.
To generate C2(x,y) (items in Ck are x and y), do L(k-1) * L(k-1) . To find L2 from
C2, instead of scanning complete database and all transactions, we remove transaction
where ST < k (where k is 2, 3…) and also remove the deleted transaction from L 1 as
well. This helps in reducing the time to scan the infrequent transactions from the
database.
Find minimum support from x and y and get transaction ids of minimum support
count item from L1. Now, Ck is scanned for specific transactions only (obtained above)
and from decreased DB size. Then, L2 is generated by C2 where support of Ck >=
min_supp.
C3(x,y,z), L3 and so on is generated repeating above steps until no frequent items
sets can be discovered.

Algorithm Apriori
Input: transactions database, D
Minimum support, min_sup

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Output Lk: frequent itemsets in D
1. find ST //for each transaction in DB
2. L1=find frequent_1_itemset (D)
3. L1= find frequent_1_itemset (D)
4. L1+=get_txn_ids(D)
5. for (k=2;Lk-1≠Φ ; k++){
6. Ck=generate_candidate (Lk-1)
7. x= item_min_sup(Ck, L1) //find item from Ck(a,b) which has minimum support
using L1
8. target =get_txn_ids(x) //get transactions for each item
PrePrints

9. foreach (txn t in tgt) do{


10. Ck.count++
11. Lk=(items in Ck>=min_sup)
12.} //end foreach
13. foreach(txn in D){
14. if(ST=(k-1))
15. txn_set+=txn
16. //end foreach
17. delete_txn_DB(txn_set) //reduce DB size
18. delete_txn_L1(txn_set,L1) //reduce transaction size in L1
19.} //end for
Algorithm. 3. Proposed Apriori Algorithm

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
7
PrePrints

Fig.2. Proposed Apriori Algorithm Steps

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
5. EXPERIMENTAL EXAMPLE

Below is the transaction database (D) having 10 transactions and min_sup=3. Size of
transaction (ST) is calculated for each transaction. (Refer figure 3).
PrePrints

Fig.3. Transaction Database

All the transactions are scanned to get frequent-1-itemset, L1. It contains items,
respective support count and transactions from D which contain the items. Infrequent
candidates’ i.e. itemsets whose support < min_sup are eliminated or deleted. (Refer
Figure 4 and Figure 5)

Fig.4. Candidate-1-itemset

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
9
PrePrints

Fig.5. Frequent-1-itemset

From L1, frequent-2-itemset (L2) is generated as follows. Example: consider itemset


{I1, I2}. In classical apriori, all transactions are scanned to find {I 1, I2} in D. But in our
proposed idea, firstly, transaction T 9 is deleted from D as well as from L1 as ST for T9
is less than k (k=2). New D and L1 are shown in figure 6 and figure 7 respectively.
Secondly, {I1, I2} is split into {I1} and {I2} and item with minimum support i.e. {I1} is
selected using L1 and its transactions will be used in L2. So, {I1, I2} will be searched
only in transactions which contain {I1} i.e. T1, T3, T7, T10.

So, searching time is reduced twice:


 By reducing database size
 By cutting down the number of transactions to be scanned.

L2 is shown in Figure 8.

Fig.6. Transaction Database (updated)

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
Fig.7. Frequent-1-itemset (updated)
PrePrints

Fig.8. Frequent-2-itemset

To generate frequent-3-itemset (L3), D is updated by deleting transactions T 6 and


T10 as ST for these transactions is less than k (k=3). L1 is also updated by deleting
transactions T6 and T10. Then, repeating above process, L3 is generated and infrequent
itemsets are deleted. Refer figure 9, figure 10 and figure 11 for updated database, L 1
and L3 respectively.

Fig.9. Transaction Database (updated)

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
11
PrePrints

Fig.10. Frequent-1-itemset (updated)

Fig.11. Frequent-3-itemset

So, above process is followed to find frequent-k-itemset for a given transaction


database. Using frequent-k-itemset, association rules are generated from non-empty
subsets which satisfy minimum confidence value.

6. COMPARATIVE ANALYSIS

We have counted the number of transactions that are scanned to find L1, L2 and L3 for
our given example and below figure shows the difference in count of transactions
scanned by using original apriori algorithm and our proposed idea.

Fig.12. Comparative Results

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
For k=1, number of transactions scanned is same for both classical apriori and our
proposed idea but with the increase in k, count of transactions decrease. Refer below
figure.
PrePrints

Fig.13. Comparative Analysis

7. CONCLUSION

We have proposed an idea to improve the efficiency of apriori algorithm by reducing


the time taken to scan database transactions. We find that with increase in value of k,
number of transactions scanned decreases and thus, time consumed also decreases in
comparison to classical apriori algorithm. Because of this, time taken to generate
candidate item sets in our idea also decreases in comparison to classical apriori.

REFERENCES

1. J. Han and M. Kamber, Conception and Technology of Data Mining, Beijing: China
Machine Press, 2007.
2. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery
in databases,” vol. 17, no. 3, AI magazine, 1996, pp. 37.
3. S. Rao, R. Gupta, “Implementing Improved Algorithm over APRIORI Data Mining
Association Rule Algorithm”, International Journal of Computer Science And Technology,
pp. 489-493, Mar. 2012
4. H. H. O. Nasereddin, “Stream data mining,” International Journal of Web Applications, vol.
1, no. 4,pp. 183–190, 2009.

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015
13

5. M. Halkidi, “Quality assessment and uncertainty handling in data mining process,” in Proc,
EDBT Conference, Konstanz, Germany, 2000.
6. Rakesh Agarwal, Ramakrishna Srikant, “Fast Algorithm for mining association rules”
VLDB Conference Santiago, Chile, 1994, pp 487-499.
PrePrints

PeerJ PrePrints | https://dx.doi.org/10.7287/peerj.preprints.1159v1 | CC-BY 4.0 Open Access | rec: 5 Jun 2015, publ: 5 Jun 2015

You might also like