AprioriTID Algorithm Improved From Apriori Algorithm
AprioriTID Algorithm Improved From Apriori Algorithm
ISBN: 978-1-60595-383-0
ABSTRACT: Association rule mining is one of the most popular technologies. This paper researches on Apri-
oriTid of association rule mining technique, and concludes the two main problems in AprioriTid. One is giant
candidate Tid list, the other one is the mount of nonsense projects storage. As to the main problems in AprioriTid
Algorithm, this paper proposes an improved AprioriTid Algorithm based on transaction set compression and
candidate item set compression. This algorithm optimizes the self-connection of frequent item set to reduce the
candidate item set through deleting affairs and reducing data set; after testing and comparing the algorithm in
many aspects by UCI standard test set, the improved algorithm grows by 10%-20% in time efficiency.
Keywords: data mining; association rule; AprioriTid Algorithm; improved AprioriTid Algorithm
107
Each row in this table corresponds to a transaction, and contains a subset of the items in I.
which contains a unique identifier labeled TID and a A rule is defined as an implication of the form:
set of items bought by a given customer. Retailers are
interested in analyzing the data to learn about the pur- X YWhereX, Y IandX Y
chasing behavior of their customers. Such valuable
information can be used to support a variety of busi- Every rule is composed of two different set of items,
ness-related applications such as marketing promo- also known as item sets, X and Y, where X is called
tions, inventory management, and customer relation- antecedent or left-hand-side (LHS) and Y consequent
ship management. Association analysis is useful for or right-hand-side (RHS).
discovering interesting relationships hidden in large To illustrate the concepts, we use a small example
datasets. The uncovered relationships can be repre- from the supermarket domain. The set of items is
sented in the form of association rules or sets of fre- I={Bread, Milk, Diapers, Beer, Cola} and in the Table
quent items. For example, the following rule can be 2, the A binary 0/1 representation of market basket
extracted from the data set shown in Table 1. data, is shown a small database containing the items,
where, in each entry, the value 1 means the presence
Table 1. An example of market basket transactions. of the item in the corresponding transaction, and the
TID Items
value 0 represents the absence of an item in a that
1 {Bread, Milk} transaction.
2 {Butter }
3 {Beer, Diaper} Table 2. A binary 0/1representation of market basket data.
4 {Bread, Milk, Butter} TID Milk Bread Butter Beer Diaper
5 {Bread } 1 1 1 0 0 0
2 0 0 1 0 0
The rule suggests that a strong relationship exists 3 0 0 0 1 1
between the sale of butter, bread and milk meaning 4 1 1 1 0 0
that if butter and bread are bought, while customers 5 0 1 0 0 0
also buy milk. Retailers can use this type of rules to
help them identify new opportunities for cross selling An example rule for the supermarket could be
their products to the customers. {butter, bread} {milk} meaning that if butter and
Besides market basket data, association analysis is bread are bought, customers also buy milk.
also applicable to other application domains such as In order to select interesting rules from the set of all
bioinformatics, medical diagnosis, web mining, and possible rules, constraints on various measures of
scientific data analysis. In the analysis of Earth sci- significance and interest are used. The best-known
ence data, for example, the association patterns may constraints are minimum thresholds on support and
reveal interesting connections among the ocean, land, confidence.
and atmospheric processes. Such information may Let X be an item-set, X Y be an association
help Earth scientists develop a better understanding of rule and T be a set of transactions of a given database.
how the different elements of the Earth system interact Support: The support value of X with respect to T is
with each other. Even though the techniques presented defined as the proportion of transactions in the database
here are generally applicable to a wider variety of data which contains the item-set X. In formula: sup(X).
sets, for illustrative purposes, our discussion will focus In the example database, the item-set {milk, bread,
mainly on market basket data. There are two key is- butter} has a support of 1/5=0.2 since it occurs in 20% of
sues that need to be addressed when applying associa- all transactions (1 out of 5 transactions). The argument of
tion analysis to market basket data. First, discovering sup(X) is a set of preconditions, and thus becomes more
patterns from a large transaction data set can be com- restrictive as it grows (instead of more inclusive).
putationally expensive. Second, some of the discov- Confidence: The confidence value of a rule, X Y ,
ered patterns are potentially spurious because they with respect to a set of transactions T, is the propor-
may happen simply by chance. tion of the transactions that contains X which also
contains Y.
2.1 Basic concepts Confidence is defined as:
108
2.2 The algorithms Every factor in Ck with the form of <TID, {Xk}>,
among them {Xk} is the biggest set of itemset in all
Association rule research can be divided in to two
the potential transactions by TID only-identified. The
sub-tasks: [2]
first step of AprioriTid Algorithm is to scan data D
(1) Find out all the frequent item set in data set D
and calculate the support of C1 in candidate set 1, and
according to the minimum support.
then generate 1 frequent item set L1. C1 is the same as
(2) Generate association rule according to frequent
data set D. The second step is to process circularly
item set and minimum confidence.
until no generation of frequent item set. The circle
The first task is to find all the frequent item set in D.
process is record all the transaction like t D ,
It’s the main problem and standard judgment of asso-
ciation rule research. Most studies are focused on this. t.TID,{c C k | c t} C k . In the K-th step, use
Apriori algorithm[3,4] is one kind of hierarchical the frequent item set k-1(Lk-1) from K-1-th step to
computation which belongs to association rule re- generate candidate item set K(Ck).Then Ck traverse
search, meanwhile, it’s a common algorithm with Ck-1 to calculate the support in Ck, simultaneously,
good combination property. Apriori algorithm needs generate frequent item set Lk. With the increase of K,
multi-steps processing to search frequent item set: the the value of Ck is far less than transaction data set D.
first step is to find out 1 frequent item set, and then Besides, it reduces the I/O operation time and the size
circular process until there is no generation of frequent of scanning date set and improves the efficiency of the
item set. When it comes the K cycle, use apriori-gen algorithm.
function to generate candidate set of K-dimension (Ck),
and then search the support rate of Ck in database, next,
compare with the minimum support rate and find out 3 OPTIMIZED APRIORITID ALGORITHM
K-dimension frequent item set Lk. The join step and
prune step are needed in apriori-gen function. The 3.1 The feature of optimized AprioriTid Algorithm as
prune step can diminish the number of entry in candi- follows
date item set Ck.
Here’s the main idea: Property 1: All the nonvoid subsets in any frequent
item set are frequent item set, and the superset of
(1) L 1 {l arg e 1 itemsets}; non-frequent item set is non- frequent item set[6].
Property 2: If frequent item set k can generate item
(2) for (k=2; L k 1 ; k++) do begin set k+1, then the number of frequent item set k must
(3) C k apriori gen(L k 1 ); / / New candidates
greater than k.
It can be proved from Property 1 that in Lk+1 the
(4) for all transactions t D do begin different k+1 item of k subset must belong to set fre-
quent item set k.
(5) C t subste(C k , t); Property 3: The support of any piece of transaction
in frequent item set Lk support at least k pieces of k-1
/ / candidates contains in transations t
item set in Lk-1;
(6) for all candidates c C t do Property 4: The non-frequent item in AprioriTid
transaction data Ck can be dismissed when calculating
(7) c.count++; Lk+1.
According to proof by contradiction, if non-frequent
(8) end
item cannot be dismissed, then it supports Lk+1, which
(9) L k {c C k | c.count min sup} stands against Nature 1.
(10) end
3.2 The optimization idea of AprioriTid
(11) Aswer= L K
(1) The improved algorithm based on transaction item
set compression [7]
AprioriTid Algorithm [5] is the classic association
In the k-th step, during the traversing of candidate
rule algorithm improved from Apriori Algorithm.
Apriori Algorithm only uses data D at the first time item set Ck in Ck 1 , c C k . If the number of poten-
when calculating the support of unidimensional item tial large item set is less than or equal to 1 in any
set with the iteration calculate support from data D to transaction record, then delete the transaction record
candidate item set. While AprioriTid takes Ck to re- directly.
place the work, Ck stores K-dimension candidate item Authentication: in k-th step, c Ck , during the
set in transaction item set instead of store transaction
process of Ck traversing Ck 1 and generating Ck,
item set directly to decide whether K-dimension
candidate item set can be K-dimension frequent item should judgment of Ck 1 ’s each pieces of transaction
set. record whether included in or not in c-[k-1] and c-[k],
109
therefore, each pieces of transaction record contains at 4 EXPERIMENT AND THE RESULT
least two or more items.
(2) The improved algorithm based on candidate To verify the efficient of the improved algorithm, this
item set compression [8] experiment compares the frequent item set generat-
Before association rule digging, algorithm needs to ing-time of improved algorithm (named NewApriori-
preprocess the original data and build data dictionary Tid) and AprioriTid Algorithm under different sup-
to make attributive classification and simple number port.
equivalent. In this paper, the algorithm ranks data set To better verify NewAprioTid algorithm efficiency
by dictionary ascending order, in order to reduce the so we use the UCI standard test data set – Mushroom
candidate item set, when generate the K candidate data set. UCI standard test data is the popular data set
itemsets just keep the support is greater than the min- in data mining, so it can be got freely from the Internet
imum support degree of itemsets, while K ≥ 2 gener- and has much authority. Experimental environment is
ate candidate itemsets k Tid item set in the table with Windows 7 operating system, 2.50 GHz CPU, 4.0 GB
the set instead of the item set (just like this Memory, and the JAVA language as the development
t.TID,{{c Lk 1}| c t} ) and this method could language.
reduce the data storage. After getting the frequent item
set Lk, we should count for each item. If the item’s 4.1 Comparison between the original algorithm and
number is less than the minsup, we should remove it the improved algorithm and analysis
and use a new frequent itemsets L'k to replace the (1) Fix data centralized transaction number and com-
original Lk frequent itemsets, which reduces the next pare the original algorithm and the improved algo-
time the number of candidate itemsets generation and rithm under different supporting threshold value.
also the storage space and time. Choose 4,000 data randomly in transactional data-
bases, and then get the different running time in Table
3.3 Improved AprioriTid Algorithm pseudocode 3.
The improved algorithm is as follow: Table 3. The run time of 4000 transactions.
Support 0.5 0.55 0.6 0.65 0.7
(1) C1 = D The run time of AprioriTid
340 330 304 283 261
(ms)
(2) L1 = getFreq1Itemset(C1 ) The run time of NewApri-
313 290 270 242 214
(3) L'1 = L1 oriTid (ms)
Improvement efficiency 8% 12% 11% 14% 18%
(4) for(k 2;(L' k 1 ) & &(C k 1 ); k ){
(5) Ck Apriori _ gen(Lk 1 ) To compare the running efficiency clearly this pa-
per takes Figure1 as description.
(6) Ck
(7) if(transactin t k)
340
(8) contiune
320
(9) else if(c Ck ){
(10) if(c t){ 300
Run time/ms
(11) Ct Ct c 280
(12) c.sup++ 260
(13) } AprioriTid
240 NewAprioriTid
(14) }
220
(15) if (C t ) & &(C t item set'number k)
200
(16) { Ck Ck ( t.tid, C t )} 0.5 0.55 0.6 0.65 0.7
Support
(17) L k {c Ck |c.sup min sup}
Figure 1. The run time of 4000 transactions.
(18) L' k getFreqKItemSet(L k )
(19)}
It can be seen from the table above that the features
of two algorithms are relatively stable. It shows that
running time decreases with the increasing of the
support. From the saving-time efficiency line, we can
see the improved algorithm takes less time than be-
fore. Besides, when support is relatively low, the time
110
improved algorithm takes reducing 10%-20%. With set with the fixed support 0.5 condition. It shows that
the increase of threshold value, the time-saving effi- the general tendencies of the two algorithms are the
ciency of improved one becomes more apparent. In same. Meanwhile, the improved algorithm has ad-
the beginning of the program operation, the lower vantages in time increasing rate
supporting threshold value makes condition-satisfying
frequent item set and association rule become more, so
running time is relatively long. With the increasing of 5 CONCLUSION
support, rangeability of running time becomes less.
This shows the superiority of improved algorithm. Association rule digging problem can be divided into
Due to the high support at the beginning of program, it two sub-tasks: find frequent item set and generate
restrains the generation of large number of candidate association rule. Find frequent item set is the core of
item set and frequent item set which satisfy lower association rule digging. This paper proposed an im-
support and saves a lot of time to make the rangeabil- proved AprioriTid algorithm based on the transaction
ity bigger. set compression and candidate item set compression.
(2) When comparing the efficiency under different On the one hand, it compresses the size of data set
transaction, take minimum support 0.5, and then ran- efficiently and reduces the times of scanning data set;
domly take 5000,6000,7000,8000 data to compare and on the other hand, it reduces the candidate item set and
analysis. The results are as follow in Table 4. improves the algorithm. The experimental result has
shown that the improved AprioriTid Algorithm is
Table 4. The efficiency of different transaction. better than the original one.
The number of transaction set 5000 6000 7000 8000
The run time of AprioriTid (ms) 263 291 326 354
The run time of NewAprioriTid (ms) 227 258 283 314 REFERENCES
For better comparison, the data are described as [1] Han, Jiawei, Micheline Kamber, Jian Pei. 2011. Data
Figure 2. Mining: Concepts and Techniques. Elsevier,
[2] Agrawal R., Mannila H., Srikant R., et al. 1996. Fast
360 discovery of association rules. Advances in Knowledge
Discovery and Data Mining, 12(1): 307-328.
340 [3] Park J.S., Chen M.S., Yu P.S. 1995. An Effective
Hash-based Algorithm for Mining Association Rules.
320 ACM, 24(2): 175-186.
[4] Pasquier N., Bastide Y., Taouil R., et al. 1999. Discov-
Run time/ms
111