0% found this document useful (0 votes)
18 views12 pages

JDM 6

Uploaded by

dedetargz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

JDM 6

Uploaded by

dedetargz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1

A Transaction Mapping Algorithm for Frequent


Itemsets Mining
Mingjun Song, and Sanguthevar Rajasekaran, Member, IEEE

Abstract— In this paper, we present a novel algorithm by generating candidate k+1-itemsets from frequent
for mining complete frequent itemsets. This algorithm is k-itemsets. The frequency of an itemset is com-
referred to as the TM (Transaction Mapping) algorithm puted by counting its occurrence in each transaction.
from hereon. In this algorithm, transaction ids of each
Many variants of the Apriori algorithm have been
itemset are mapped and compressed to continuous trans-
action intervals in a different space and the counting developed, such as AprioriTid, ArioriHybrid, direct
of itemsets is performed by intersecting these interval hashing and pruning (DHP), dynamic itemset count-
lists in a depth-first order along the lexicographic tree. ing (DIC), Partition algorithm, etc. For a survey on
When the compression coefficient becomes smaller than the association rules mining algorithms, please see [3].
average number of comparisons for intervals intersection FP-growth [4] is a well-known algorithm that
at a certain level, the algorithm switches to transaction id
uses the FP-tree data structure to achieve a con-
intersection. We have evaluated the algorithm against two
popular frequent itemset mining algorithms - FP-growth densed representation of the database transactions
and dEclat using a variety of data sets with short and long and employs a divide-and-conquer approach to de-
frequent patterns. Experimental data show that the TM compose the mining problem into a set of smaller
algorithm outperforms these two algorithms. problems. In essence, it mines all the frequent item-
Index Terms— Algorithms, Association Rule Mining, sets by recursively finding all frequent 1-itemsets
Data Mining, Frequent Itemsets. in the conditional pattern base that is efficiently
constructed with the help of a node link structure.
A variant of FP-growth is the H-mine algorithm [5].
I. I NTRODUCTION
It uses array-based and trie-based data structures
SSOCIATION rules mining is a very popular to deal with sparse and dense datasets respectively.
A data mining technique and it finds relation- PatriciaMine [6] employs a compressed Patricia trie
ships among the different entities of records (for to store the datasets. FPgrowth* [7] uses an array
example transaction records). Since the introduction technique to reduce the FP-tree traversal time. In
of frequent itemsets in 1993 by Agrawal et al. [1], FP-growth based algorithms, recursive construction
it has received a great deal of attention in the field of the FP-tree affects the algorithm’s performance.
of knowledge discovery and data mining. Eclat [8] is the first algorithm to find frequent
One of the first algorithms proposed for associ- patterns by a depth-first search and it has been
ation rules mining was the AIS algorithm [1]. The shown to perform well. It uses a vertical database
problem of association rules mining was introduced representation and counts the itemset supports us-
in [1] as well. This algorithm was improved later ing the intersection of tids. However, because of
to obtain the Apriori algorithm [2]. The Apriori the depth-first search, pruning used in the Apriori
algorithm employs the downward closure property algorithm is not applicable during the candidate
- if an itemset is not frequent, any superset of it itemsets generation. VIPER [9] and Mafia [10] also
cannot be frequent either. The Apriori algorithm use the vertical database layout and the intersection
performs a breadth-first search in the search space to achieve a good performance. The only difference
is that they use the compressed bitmaps to represent
M. Song is with the Department of Computer Science and
Engineering, University of Connecticut, Storrs, CT 06269. Email: the transaction list of each itemset. However, their
[email protected]. compression scheme has limitations especially when
S. Rajasekaran is with the Department of Computer Science and tids are uniformly distributed. Zaki and Gouda [11]
Engineering, University of Connecticut, Storrs, CT 06269. Email:
[email protected]. developed a new approach called dEclat using the
Manuscript received December 14, 2004; revised October 5, 2005. vertical database representation. They store the dif-
2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE I
ference of tids called diffset between a candidate k-
H ORIZONTAL REPRESENTATION
itemset and its prefix k−1-frequent itemsets, instead
of the tids intersection set, denoted here as tidset. tid items
They compute the support by subtracting the cardi- 1 2, 1, 5, 3
nality of diffset from the support of its prefix k − 1- 2 2, 3
frequent itemset. This algorithm has been shown 3 1, 4
to gain significant performance improvements over 4 3, 1, 5
Eclat. However, when the database is sparse, diffset 5 2, 1 ,3
will lose its advantage over tidset. 6 2, 4
In this paper, we present a novel approach that
maps and compresses the transaction id list of each TABLE II
itemset into an interval list using a transaction tree, V ERTICAL TIDSET REPRESENTATION
and counts the support of each itemset by intersect-
ing these interval lists. The frequent itemsets are item tidset
found in a depth-first order along a lexicographic 1 1, 3, 4, 5
tree as done in the Eclat algorithm. The basic idea 2 1, 2, 5, 6
is to save the intersection time in Eclat by mapping 3 1, 2, 4, 5
transaction ids into continuous transaction intervals. 4 3
When these intervals become scattered, we switch 5 1, 4
to transaction ids as in Eclat. We call the new
algorithm the TM (transaction mapping) algorithm.
The rest of the paper is arranged as follows: section association rules consists of two phases. In the first
II introduces the basic concept of association rules phase, all frequent itemsets that satisfy the min sup
mining, two types of data representation, and the are found. In the second phase, strong association
lexicographic tree used in our algorithm; section III rules are generated from the frequent itemsets found
addresses how the transaction id list of each itemset in the first phase. Most research considers only the
is compressed to a continuous interval list, and the first phase because once frequent itemsets are found,
details of the TM algorithm; section IV gives an mining association rules is trivial.
analysis of the compression efficiency of transaction
mapping; section V experimentally compares the B. Data Representation
TM algorithm with two popular algorithms - FP-
Two types of database layouts are employed in
Growth and dEclat; in section VI, we provide some
association rules mining: horizontal and vertical.
general comments; section VII concludes the paper.
In the traditional horizontal database layout, each
transaction consists of a set of items and the data-
II. BASIC PRINCIPLES base contains a set of transactions. Most Apriori-like
A. Association Rules Mining algorithms use this type of layout. For vertical data-
base layout, each item maintains a set of transaction
Let I = {i1 , i2 , . . . , im } be a set of items and
ids (denoted by tidset) where this item is contained.
let D be a database having a set of transactions
This layout could be maintained as a bitvector. Eclat
where each transaction T is a subset of I. An
uses tidsets while VIPER and Mafia use compressed
association rule is an association relationship of
bitvectors. It has been shown that vertical layout
the form: X ⇒ Y , where X ⊂ I, Y ⊂ I, and
performs generally better than horizontal format [8]
X ∩ Y = ∅. The support of rule X ⇒ Y is defined
[9]. Tables I through III show examples for different
as the percentage of transactions containing both X
types of layouts.
and Y in D. The confidence of X ⇒ Y is defined
as the percentage of transactions containing X that
also contain Y in D. The task of association rules C. Lexicographic Prefix Tree
mining is to find all strong association rules that In this paper, we employ a lexicographic prefix
satisfy a minimum support threshold (min sup) and tree data structure to efficiently generate candidate
a minimum confidence threshold (min conf). Mining itemsets and count their frequency, which is very
SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 3

TABLE III
V ERTICAL BITVECTOR REPRESENTATION

item bitvector
1 101110
2 110011
3 110110
4 001000
5 100100

similar to the lexicographic tree used in the TreeP-


rojection algorithm [12]. This tree structure is also Fig. 1. Illustration of lexicographic tree
used in many other algorithms such as Eclat [8].
An example of this tree is shown in Fig. 1. Each
node in the tree stores a collection of frequent III. TM ALGORITHM
itemsets together with the support of these itemsets. Our contribution is that we compress tids (trans-
The root contains all frequent 1-itemsets. Itemsets action ids) for each itemset to continuous intervals
in level l (for any l) are frequent l-itemsets. Each by mapping transaction ids into a different space
edge in the tree is labeled with an item. Itemsets appealing to a transaction tree. The finding of fre-
in any node are stored as singleton sets with the quent itemsets is done by intersecting these interval
understanding that the actual itemset also contains lists instead of intersecting the transaction id lists
all the items found on the edges from this node to (as in the Eclat algorithm). We will begin with the
the root. For example, consider the leftmost node construction of a transaction tree.
in level 2 of the tree in Fig. 1. There are four 2-
itemsets in this node, namely, {1,2}, {1,3}, {1,4},
and {1,5}. The singleton sets in each node of the A. Transaction tree
tree are stored in the lexicographic order. If the root The transaction tree is similar to FP-tree except
contains {1}, {2}, . . . , {n}, then, the nodes in level that there is no header table or node link. The
2 will contain {2}, {3}, . . . , {n}; {3}, {4}, . . . , {n}; transaction tree can be thought of as a compact
. . . ; {n}, and so on. For each candidate itemset, representation of all the transactions in the database.
we also store a list of transaction ids (i.e., ids of Each node in the tree has an id corresponding
transactions in which all the items of the itemset to an item and a counter that keeps the number
occur). This tree will not be generated in full. The of transactions that contain this item in this path.
tree is generated in a depth first order and at any Adapted from [4], the construction of the transaction
given time, we only store minimum information tree (called construcTransactionTree) is as follows:
needed to continue the search. In particular, this 1) Scan through the database once and identify all
means that at any instance at most a path of the the frequent 1-itemsets and sort them in descending
tree will be stored. As the search progresses, if the order of frequency. At the beginning the transaction
expansion of a node cannot possibly lead to the tree consists of just a single node (which is a dummy
discovery of itemsets that have minimum support, root).
then the node will not be expanded and the search 2) Scan through the database for a second time.
will backtrack. As a frequent itemset that meets the For each transaction, select items that are in frequent
minimum support requirement is found, it is output. 1-itemsets, sort them according to the order of fre-
Candidate itemsets generated by depth first search quent 1-itemsets and insert them into the transaction
are the same as those generated by the joining step tree. When inserting an item, start from the root.
(without pruning) of the Apriori algorithm. At the beginning the root is the current node. In
4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE IV TABLE V
A SAMPLE TRANSACTION DATABASE E XAMPLE OF TRANSACTION MAPPING

TID Items Ordered frequent items Item Mapped transaction interval list
1 2,1,5,3,19,20 1,2,3 1 [1,500]
2 2,6,3 2,3 2 [1,200], [501,800]
3 1,7,8 1 3 [1,300], [501,600]
4 3,1,9,10 1,3 4 [601,800]
5 2,1,11,3,17,18 1,2,3
6 2,4,12 2,4
7 1,13,14 1 is done recursively starting from the root in a depth-
8 2,15,4,16 2,4 first order. The process is described as follows:
Consider a node u whose number of transactions
is c and whose associated interval is [s, e]. Here
s is the relabeled start id and e is the relabeled
end id with e − s + 1 = c. Assume that u has
m children with child i having ci transactions,Pm
for
i = 1, 2, . . . , m. It is obvious that i=1 ci ≤ c. If
the intervals associated with the children of u are:
[s1 , e1 ], [s2 , e2 ], . . . , [sm , em ], these intervals are
constructed as follows:
s1 = s (1)
e1 = s1 + c1 − 1 (2)
si = ei−1 + 1, f or i = 2, 3, . . . , m (3)
ei = si + ci − 1, f or i = 2, 3, . . . , m (4)
Fig. 2. A transaction tree for the above database
For the root, s = 1. For example, in Fig. 2, the root
has two children. For the first child, s1 = 1, e1 =
general, if the current node has a child node whose 1 + 5 − 1 = 5, so the interval is [1,5]; for the
id is equal to this item, then just increment the count second child, s2 = 5 + 1 = 6, e2 = 6 + 3 − 1 = 8,
of this child by 1, otherwise create a new child node so the interval is [6,8]. The compressed transaction
and set its counter as 1. id lists of each item is ordered by the start id of
Table IV and Fig. 2 illustrate the construction each associated interval. In addition, if two intervals
of a transaction tree. Table IV shows an example are contiguous, they will be merged and replaced
of a transaction database and Fig. 2 displays the with a single interval. For example, each interval
constructed transaction tree assuming the minimum associated with each node is shown in Fig. 2. Two
support count is 2. The number before the colon in intervals of item 3, [1,2] and [3,3] will be merged
each node is the item id and the number after the to [1,3].
colon is the count of this item in this path. To illustrate the efficiency of this mapping
process more clearly, assume that the eight trans-
actions of the example database shown in table IV
B. Transaction mapping and the construction of repeat 100 times each. In this case the transaction
interval lists tree becomes the one shown in Fig. 3.
After the transaction tree is constructed, all the The mapped transaction interval lists for each
transactions that contain an item are represented item is shown in Table V, where 1-300 of item 3
with an interval list. Each interval corresponds to a results from the merging of 1-200 and 201-300.
contiguous sequence of relabeled ids. Each node in We now summarize a procedure (called map-
the transaction tree will be associated with an inter- TransactionIntervals) that computes the interval lists
val. The construction of interval lists for each item for each item as follows: Using depth first order,
SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 5

For instance, interval [1,500] and interval [501,800]


in table V.
2) A ⊇ B. In this case, interval A comes from the
ancestor nodes of interval B in the transaction tree.
For instance, interval [1,500] and interval [1,300] in
table V.
3) A ⊆ B. In this case, interval A comes
from the descendant nodes of interval B in the
transaction tree. For instance, interval [1,300] and
interval [1,500] in table V.
Considering the above three cases, the average
number of comparisons for two intervals is 2.

Fig. 3. Transaction tree for illustration D. Switching


After a certain level of the lexicographic tree,
traverse the transaction tree. For each node, create the transaction interval lists of elements in any
an interval composed of a start id and an end id. If it node will be expected to become scattered. There
is the first child of its parent, then the start id of the could be many transaction intervals that contain
interval is equal to the start id of the parent (equation only single tids. At this point, interval representation
(1)) and the end id is computed by equation (2). If will lose its advantage over single tid represen-
not, the start id is computed by equation (3), and tation, because the intersection of two segments
the end id is computed by equation (4). Insert this will use three comparisons in the worst case while
interval to the interval list of the corresponding item. the intersection of two single tids only needs one
Once the interval lists for frequent 1-itemsets are comparison. Therefore, we need to switch to the
constructed, frequent i-itemsets (for any i) are found single tid representation at some point. Here, we
by intersecting interval lists along the lexicograpgic define a coefficient of compression for one node in
tree. Details are provided in the next subsection. the lexicographic tree, denoted by coeff, as follows:
Assume that a node has m elements, and let si rep-
C. Interval lists intersection resent the support of the ith element, li representing
In addition to the items described above, each the size of the transaction list of the ith element.
element of a node in the lexicographic tree also Then, m
1 X si
stores a transaction interval list (corresponding to coef f =
the itemset denoted by the element). By constructing m i=1 li
the lexicographic tree in a depth-first order, the For the intersection of two interval lists, the
support count of the candidate itemset is computed average number of comparisons is 2, so we will
by intersecting the interval lists of the two elements. switch to tid set intersection when coeff is less than
For example, element 2 in the second level of the 2.
lexicographic tree in Fig. 1 represents the itemset
1,2, whose support count is computed by intersect-
ing the interval lists of itemset 1 and itemset 2. In E. Details of the TM Algorithm
contrast, Eclat uses a tid list intersection. Interval Now we provide details on the steps involved in
lists intersection is more efficient. Note that since the TM algorithm. There are four steps involved:
the interval is constructed from the transaction tree, 1) Scan through the database and identify all
it cannot partially contain or be partially contained frequent-1 itemsets.
in another interval. There are only three possible 2) Construct the transaction tree with counts for
relationships between any two intervals A = [s1 , e1 ] each node.
and B = [s2 , e2 ]. 3) Construct the transaction interval lists. Merge
1) A∩B = ∅. In this case, interval A and interval intervals if they are mergeable (i.e., if the intervals
B come from different paths of the transaction tree. are contiguous).
6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Here each T represents the interval for a node,


and |T | represents the length of T , which is equal
to C. The maximum number of intervals possible
for each frequent 1-itemset i is 2i−1 .
The average compression ratio is
S2 S 3 Si
Avgratio ≥ S1 + 1
+ 2 + . . . + i−1 + . . .
2 2 2
Sn
+
Fig. 4. Full transaction tree 2n−1
1 1 1
≥ Sn (1 + 1
+ 2 + . . . + n−1 )
2 2 2
4) Construct the lexicographic tree in a depth = 2Sn (1 − 2−n )
first order keeping only the minimum amount of
information necessary to complete the search. This When Sn , which is equal to min sup, is high, the
in particular means that no more than a path in compression ratio will be large and thus the inter-
the lexicographic tree will ever be stored. While at section time will be less. On the other hand, because
any node, if further expansion of that will not be the compression ratio for any itemset cannot be less
fruitful, then the search backtracks. When process- than 1, we assume that for frequent 1-itemset i, the
Si
ing a node in the tree, for every element in the compression ratio is equal to 1, i.e., 2i−1 = 1. Then
node, the corresponding interval lists are computed for all frequent 1-itemsets (in the first level of the
by interval intersections. As the search progresses, lexicographic tree) whose ID number is less than i,
itemsets with enough support are output. When the the compression ratio is greater than 1 and for all
compression coefficient of a node becomes less than frequent 1-itemsets whose ID number is larger than
2, switch to tid list intersection. i, the compression ratio is equal to 1. Therefore, we
In the next section we provide an analysis to indi- have:
cate how TM can provide computational efficiency. S 2 S3 Si
Avgratio ≥ S1 + 1 + 2 + . . . + i−1 + n − i
2 2 2
IV. COMPRESSION AND TIME ANALYSIS ≥ 2Si (1 − 2−i ) + n − i
OF TRANSACTION MAPPING = 2i − 1 + n − i
Suppose the transaction tree is fully filled in the
worst case as illustrated in Fig. 4, where the sub- Since 2i > i, when i is large, i.e., when fewer
script of C is the possible itemset, and C represents of the frequent 1-itemsets have compression equal
the count for this itemset. to 1, the transaction tree is ’narrow’. In the worst
Assume that there are n frequent 1-itemsets with case, when the transcation tree is fully filled, the
a support of S1 , S2 , . . . , Sn respectively. Then we compression ratio reaches the minimum value. Intu-
have the following relationships: itively, when the size of the dataset is large and there
are more repetitive patterns, the transaction tree will
S1 = C1 = |T1 | be narrow. In general, market data has this kind
of characteristics. In summary, when the minimum
S2 = C2 + C1,2 = |T1 | + |T1,2 |
support is large, or the items are sparsely associated
and there are more repetitive patterns (as in the case
S3 = C3 + C1,3 + C1,2,3 + C2,3 of market data), the algorithm runs faster.
= |T3 | + |T1,3 | + |T1,2,3 | + |T2,3 |
... V. EXPERIMENTS AND PERFORMANCE
EVALUATION
Sn = Cn + C1,n + C2,n + . . . + Cn−1,n + C1,2,n
A. Comparison with dEclat and FP-growth
+C1,3,n + . . .
We used five sets of data in our experi-
= |Tn | + |T1,n | + |T2,n | + . . . + |Tn−1,n | ments. Three of these sets are synthetic data
+|T1,2,n | + |T1,3,n | + . . . (T10I4D100K, T25I10D10K, and T40I10D100k).
SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 7

TABLE VI
If the number of mismatches in AB has reached 11,
C HARACTERISTICS OF EXPERIMENT DATA SETS
then itemset ABC can not be frequent. For interval
Data #items avg. trans. length #transactions lists intersection, the number of mismatches is a
T10I4D100k 1000 10 100,000 little hard to be recorded because of complicated set
T25I10D10K 1000 25 9,219 relationships. Thus we have used the following rule:
T40I10D100k 942 39 100,000 if the number of transactions not intersected yet is
mushroom 120 23 8,124 less than the minimum support threshold minus the
Connect-4 130 43 67,557 number of matches, the intersection will be stopped.
2) Dynamic ordering
Reordering all the items in every node at each
level of the lexicographic tree in ascending order
These synthetic data resemble market basket data of support can reduce the number of generated
with short frequent patterns. The other two datasets candidate itemsets and hence reduce the number of
are real data (Mushroom and Connect-4 data) needed intersections. This property was first used
which are dense in long frequent patterns. These by Bayardo [13].
data sets were often used in the previous study 3) Save intersection with combination
of association rules mining and were down- This technique comes from the following corol-
loaded from http://fimi.cs.helsinki.fi/testdata.html lary [3]: if the support of the itemset X ∪ Y is
and http://miles.cnuce.cnr.it/ palmeri/datam/DCI equal to the support of X, then the support of
/datasets.php. Some characteristics of these datasets the itemset X ∪ Y ∪ Z is equal to the support of
are shown in in table VI. the itemset X ∪ Z. For example, if the support of
We have compared the TM algorithm mainly with itemset {1,2} is equal to the support of {1}, then
two popular algorithms - dEclat and FP-growth, the the support of the itemset {1,2,3} is equal to the
implementations of which were downloaded from support of itemset {1,3}. So we do not need to
http://www.cs.helsinki.fi/u/goethals/software, imple- conduct the intersection between {1,2} and {1,3}.
mented by Goethals, B. using std libraries. They Correspondingly, if the supports of several itemsets
were compiled in Visual C++. The TM algorithm are all equal to the support of their common prefix
was implemented based on these two codes. Small itemset (subset) that is frequent, then any combina-
modifications were made to implement the transac- tion of these itemsets will be frequent. For example,
tion tree and interval lists construction, interval lists if the supports of itemsets {1,2}, {1,3} and {1,4}
intersection and switching. The same std libraries are all equal to the support of the frequent itemset
were used to make the comparison fair. Imple- {1}, then {1,2}, {1,3}, {1,4}, {1,2,3}, {1,2,4},
mentations that employ other libraries and data {1,3,4}, and {1,2,3,4} are all frequent itemsets. This
structures might be faster than Goethals’ implemen- optimization is similar to the single path solution in
tation. Comparing such implementations with the the FP-growth algorithm.
TM implementation will be unfair. FP-growth code All experiments were performed on a DELL
was modified a little to read the whole database into 2.4GHz Pentium PC with 1G of memory, running
memory at the beginning so that the comparison of Windows 2000. All times shown include time for
all the three algorithms is fair. We did not compare outputting all the frequent itemsets. The results are
with Eclat because it was shown in [11] that dEclat presented in tables VII through XI and figures 5
outperforms Eclat. Both TM and dEclat use the through 10.
same optimization techniques described below: Table VII shows the running time of the com-
1) Early stopping pared algorithms on T10I4D100K data with differ-
This technique was used earlier in Eclat [8]. The ent minimum supports represented by percentage of
intersection between two tid sets can be stopped if the total transactions. Under large minimum sup-
the number of mismatches in one set is greater than ports, dEclat runs faster than FP-Growth while run-
the support of this set minus the minimum support ning slower than FP-Growth under small minimum
threshold. For instance, assume that the minimum supports. TM algorithm runs faster than both algo-
support threshold is 50 and the supports of two rithms under almost all minimum support values.
itemsets AB and AC are 60 and 80, respectively. On an average, TM algorithm runs almost 2 times
8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE VII
RUN TIME ( S ) FOR T10I4D100 K DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*


5 0.671 0.39 0.328 0.5 0.625 0.156
2 2.375 1.734 0.984 1.687 1.796 0.484
1 5.812 5.562 2.406 5.656 6.375 0.89
0.5 7.359 9.078 4.515 9.421 18.359 1.187
0.2 7.484 11.796 7.359 12.75 24.671 1.64
0.1 8.5 12.875 8.906 14.796 33.234 1.828
0.05 11.359 15.656 10.453 19.859 56.031 2.078
0.02 20.609 33.468 14.421 63.187 146.437 2.64
0.01 33.781 73.093 21.671 168.906 396.453 3.937

Fig. 5. Run time for T10I4D100k data (1) Fig. 7. Run time for T25I10D10K data

Fig. 8. Run time for T40I10D100K data


Fig. 6. Run time for T10I4D100k data (2)

with some exceptions at some minimum support


faster than the faster of FP-Growth and dEclat. Two values. TM algorithm runs twice faster than dEclat
graphs (Fig. 5 and Fig. 6) are employed to display on an average.
the performance comparison under large minimum Table IX and Fig. 8 show the performance
support and small minimum support, respectively. comparison of the compared algorithms on
Table VIII and Fig. 7 show the performance com- T40I10D100K data. TM algorithm runs faster
parison of the compared algorithms on T2510D10K when the minimum support is larger while slower
data. dEclat runs, in general, faster than FP-Growth when the minimum support is smaller.
SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 9

TABLE VIII
RUN TIME ( S ) FOR T25I10D10K DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*


5 0.25 0.14 0.093 0.171 0.140 0.046
2 3.093 3.203 1.109 3.937 2.359 0.25
1 4.406 4.921 2.718 5.859 4.015 0.437
0.5 5.187 5.296 3.953 6.578 5.828 0.64
0.2 10.328 6.937 5.656 10.968 17.406 1.14
0.1 31.219 20.953 10.906 51.484 54.078 2.125

TABLE IX
RUN TIME ( S ) FOR T40104D100 K DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*


5 93.156 14.266 7.687 20.265 8.515 1.39
2 240.281 36.437 23.281 49.578 23.562 4.859
1 568.671 52.734 46.343 85.421 45.921 10.453
0.5 1531.92 121.718 178.078 260.328 262.937 23.031
0.2 4437.03 483.843 853.515 1374.86 1451.83 117.015

TABLE X
RUN TIME ( S ) FOR M USHROOM DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*


5 32.203 29.828 28.125 30.515 24.687 15.687
2 208.078 196.156 187.672 207.062 141.5 104.906
1 839.797 788.781 751.89 835.89 569.828 424.859
0.5 2822.11 2668.83 2640.83 2766.47 1938.25 1478.98

Table X and Fig. 9 compare the algorithms of the diffset [11] in dEclat between a candidate k-
interest on mushroom data. dEclat is better than FP- itemset and its prefix k − 1-frequent itemset using
Growth while TM is better than dEclat. mapped transaction intervals, and compute the sup-
Table XI and Fig. 10 show the relative per- port by subtracting the cardinality of diffset from the
formance of the algorithms on Connect-4 data. support of its prefix k−1-frequent itemset. We name
Connect-4 data is very dense and hence the smallest the corresponding algorithm as dTM algorithm. We
minimum support is 40 percent in this experiment. ran the dTM algorithm on the five data sets and
Similar to the result on mushroom data, dEclat the run times are shown in tables VII through XI.
is faster than FP-Growth while TM is faster than Unexpectedly, the performance of dTM is worse
dEclat, though the difference is not significant. than that of TM. The reason is that the computation
of the difference interval sets between two itemsets
B. Experiments with dTM is more complicated than the computation of the
We have combined the TM algorithm with the intersection and has more overhead. For instance,
dEclat algorithm in the following way: we represent consider interval set1 = [s1, e1], interval set2 = [s2,
10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

TABLE XI
RUN TIME ( S ) FOR C ONNECT-4 DATA

support(%) FP-growth dEclat TM dTM MAFIA FP*


90 2.171 0.891 0.781 1.765 1.703 0.828
80 9.078 5.406 4.734 4.796 15.109 2.968
70 56.609 40.296 35.484 37.000 107.406 21.859
60 283.031 211.828 195.359 202.578 506.078 120.484
50 1204.67 935.109 871.359 928.218 2072.73 525.968
40 4814.59 3870.64 3579.38 4013.76 7764.91 2229.06

MAFIA and FP-growth*. The comparison, how-


ever, is just for reference, because the imple-
mentations of MAFIA and FP-growth* use dif-
ferent libraries and data structures, which makes
the comparison unfair. The implementation of
MAFIA was downloaded from http://himalaya-
tools.sourceforge.net/Mafia/#download, and the im-
plementation of FP-growth* was downloaded from
http://www.cs.concordia.ca/db/dbdm/dm.html. The
run times for these two algorithms are also shown
Fig. 9. Run time for Mushroom data
in tables VII through XI. TM is faster than MAFIA
for four data sets, while slower than MAFIA for
just mushroom data set. FP-growth* is the fastest
among all the algorithms experimented. The com-
parison, however, is unfair. For example, FP-tree
construction should be slower than the transaction
tree construction, but in FP-growth*, the imple-
mentation of FP-tree construction is faster than our
implementation for transaction tree construction. In
the case of a minimum support of 0.5%, FP-growth*
runs in 1.187s, while the construction of transaction
tree alone in the TM algorithm takes 1.281s. The
run time difference between FP-growth and FP-
Fig. 10. Run time for Connect-4 data
growth* is not so large in the paper of FP-growth*
[7] as in this experiment. ( [7] uses a different
implementation of FP-growth), which indicates that
e2], [s3, e3]. Both [s2, e2] and [s3, e3] are within
the implementation plays a great role.
[s1, e1]. The difference between [s3, e3] and [s1,
e1] is dependent on the difference between [s2, e2]
and [s1, e1]. So, there are more cases to consider VI. DISCUSSION
here than in the computation of the intersection of
A. Overhead of constructing interval lists and in-
two sets.
terval comparison
One may be concerned with the fact that it takes
C. Experiments with MAFIA and FP-growth* extra effort to relabel while constructing interval
In this experiment, we experimented with two lists. Fortunately, constructing the transaction tree
other algorithms mentioned in the introduction - is just done once and the relabeling of transactions
SONG AND RAJASEKARAN: A TRANSACTION MAPPING ALGORITHM FOR FREQUENT ITEMSETS MINING 11

is done just once by traversing the transaction tree of FP-tree (FP-tree has a header table and a node
in depth-first order. Relabeling time is negligible link). We use the lexicographic tree just to illustrate
compared to the intersection time. For example, for DFS procedure as in the Eclat algorithm. This tree
connect 4 data with a support of 0.5, the construc- is built on the fly and not built fully at all. So the
tion of transaction tree takes 0.734s, constructing lexicographic tree is not stored in full.
interval lists takes less than 0.001s, and generating
frequent sets takes 870.609s. In FP-growth algo- D. About comparisons
rithm, constructing the first FP-tree takes 2.844s,
which is longer than the time to construct the This paper focuses on algorithmic concepts rather
transaction tree because of building header table and than on implementations. For the same algorithm,
node link. There is overhead of interval comparison, the run time is different for different implemen-
i.e., the average number of interval comparisons tations. We downloaded the dEclat and fp-growth
is 2, according to only three cases of relationship implementations of Goethals, and implemented our
between two intervals, which is greater than the algorithm based on his codes. Data structures (Set,
number of comparisons for id intersection (which is vector, multisets) and libraries (std) used are the
1) used in Eclat algorithm. During the first several same and only the algorithmic parts are different.
levels, however, the interval compression ratio is This makes the comparison fair. Although the im-
bigger than 2. On the other hand, we keep track of plementations of MAFIA and FP-growth* used in
this compression ratio (coefficient) and when it be- this experiment are all in C/C++, data structures and
comes less than 2, we switch to single id transaction libraries used are different, which makes the com-
as in the Eclat algorithm. Therefore, our algorithm parison unfair for algorithms. For example, fp-tree
somewhat combines the advantages of both FP- construction should be slower than transaction tree
growth and Eclat. When data can be compressed construction. But in Fp-growth*, the implementa-
well by the transaction tree (one advantage of FP- tion of fp-tree construction is faster than our imple-
growth is to use FP-tree to compress the data), we mentation for transaction tree construction. Another
use interval lists intersection; when we cannot, we example is that [7] uses a different implementation
switch to id lists intersection as in Eclat. of FP-growth, so the run time difference between
FP-growth and FP-growth* is not so large as in this
experiment. For the TM algorithm we just modified
B. Run time Goethals’ implementation for fp-tree construction
The data sets we have used in our experiments and did not use faster implementations, because we
have often been used in previous research and the want to make the comparison between TM and Fp-
times shown include the time needed in all the steps. growth fair. Our implementation, however, could be
Our algorithm outperforms FP-growth and dEclat. improved to make the run time faster. We feel that if
Actually, it is also much faster than Eclat. We did we develop an implementation tailored for the TM
not show the comparison with Eclat, because dEclat algorithm instead of just modifying the downloaded
was claimed to outperform Eclat in [11]. We believe codes, TM will be competitive with FP-growth*.
that our algorithms will be faster than the Apriori
algorithm. We did not compare TM and Apriori VII. CONCLUSIONS AND FUTURE WORK
since the algorithms FP-growth, Eclat and dEclat
In this paper, we have presented a new algo-
have been shown to outperform Apriori [4] [8] [11].
rithm TM using the vertical database representation.
Transaction ids of each itemset are transformed
C. Storage cost and compressed to continuous transaction interval
Storage cost for maintaining the intervals of item- lists in a different space using the transaction tree,
sets is less than that for maintaining id lists in the and frequent itemsets are found by transaction in-
Eclat algorithm. Because once one interval is gener- tervals intersection along a lexicographic tree in
ated, its corresponding node in the transaction tree depth first order. This compression greatly saves
is deleted. Once all the interval lists are generated, the intersection time. Through experiments, TM
the transaction tree is removed, so we only need to algorithm has been shown to gain significant per-
store interval lists. The storage is also less than that formance improvement over FP-growth and dEclat
12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

on datasets with short frequent patterns, and also [11] M.J. Zaki, and K. Gouda, ”Fast vertical mining using diffsets,”
some improvement on datasets with long frequent Proceedings of the Nineth ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, Washington,
patterns. We have also performed the compression D.C., ACM Press, New York, pp. 326-335, 2003.
and time analysis of transaction mapping using the [12] R. Agrawal, C. Aggarwal, and V. Prasad, ”A Tree Projection
transaction tree and proved that transaction map- Algorithm for Generation of Frequent Item Sets,” Parallel and
Distributed Computing, pp. 350-371, 2000.
ping can greatly compress the transaction ids into [13] R. J. Bayardo, ”Efficiently mining long patterns from data-
continuous transaction intervals especially when the bases,” Procedings of ACM SIGMOD Intnational Conference
minimum support is high. Although FP-growth* is on Management of Data, ACM Press, Seattle, Washington, pp.
85-93, June 1998,
faster than TM in this experiment, the comparison
is unfair. In our future work we plan to improve the
implementation of the TM algorithm and make a
fair comparison with FP-growth*.

ACKNOWLEDGMENT
This work has been supported in part by the NSF
Grants CCR-9912395 and ITR-0326155. Mingjun Song received his first Ph.D. degree
in remote sensing from Univeristy of Con-
necticut. He is working in ADE Corporation
R EFERENCES as software research engineer and is in his
second Ph.D. program in Computer Science
[1] R. Agrawal, T. Imielinski, and A.N. Swami, ”Mining associa- and Engineering at the University of Connecti-
tion rules between sets of items in large databases,” Proceedings cut. His research interests include algorithms
of ACM SIGMOD International Conference on Management of and complexity, data mining, pattern recog-
Data, ACM Press, Washington DC, pp.207-216, May 1993. nition, image processing, remote sensing and
[2] R. Agrawal, and R. Srikant, ”Fast algorithms for mining asso- geographical information system.
ciation rules,” Proceedings of 20th International Conference on
Very Large Data Bases, Morgan Kaufmann, pp. 487-499, 1994.
[3] B. Goethals, ”Survey on Frequent Pattern Mining,” Manuscript,
2003.
[4] J. Han, J. Pei, and Y. Yin, ”Mining frequent patterns without
candidate generation,” Procedings of ACM SIGMOD Intna-
tional Conference on Management of Data, ACM Press, Dallas,
Texas, pp. 1-12, May 2000.
[5] J. Pei, J. Han, H. Lu, S. Nishio, S. Tang, and D. Yang, ”Hmine:
Hyper-structure mining of frequent patterns in large databases,” Sanguthevar Rajasekaran is a Full Professor
Proc. of IEEE Intl. Conference on Data Mining, pp. 441-448, and UTC Chair Professor of Computer Science
2001. and Engineering (CSE) at the University of
[6] A. Pietracaprina, and D. Zandolin, ”Mining frequent itemsets Connecticut. He is also the Director of Booth
using Patricia Tries,” FIMI ’03, Frequent Itemset Mining Im- Engineering Center for Advanced Technolo-
plementations, Proceedings of the ICDM 2003 Workshop on gies (BECAT) at UConn. Sanguthevar Ra-
Frequent Itemset Mining Implementations, Melbourne, Florida, jasekaran received his M.E. degree in Au-
December 2003. tomation from the Indian Institute of Science
[7] G. Grahne, and J. Zhu, ”Efficiently using prefix-trees in mining (Bangalore) in 1983, and his Ph.D. degree in
frequent itemsets,” FIMI ’03, Frequent Itemset Mining Im- Computer Science from Harvard University in 1988. Before joining
plementations, Proceedings of the ICDM 2003 Workshop on UConn, he has served as a faculty member in the CISE Department of
Frequent Itemset Mining Implementations, Melbourne, Florida, the University of Florida and in the CIS Department of the University
December 2003. of Pennsylvania. During 2000-2002 he was the Chief Scientist for
[8] M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, ”New Arcot Systems. His research interests include Parallel Algorithms,
algorithms for fast discovery of association rules,” Proceedings Bioinformatics, Data Mining, Randomized Computing, Computer
of the Third International Conference on Knowledge Discovery Simulations, and Combinatorial Optimization. He has published over
and Data Mining, AAAI Press, pp. 283-286, 1997. 130 articles in journals and conferences. He has co-authored two texts
[9] P. Shenoy, J. R. Haritsa, S. Sudarshan, G. Bhalotia, M. Bawa, on algorithms and co-edited four books on algorithms and related
and D. Shah, ”Turbo-charging vertical mining of large data- topics. He is an elected member of the Connecticut Academy of
bases,” Procedings of ACM SIGMOD Intnational Conference Science and Engineering (CASE).
on Management of Data, ACM Press, Dallas, Texas, pp. 22-23,
May 2000,
[10] D. Burdick, M. Calimlim, and J. Gehrke, ”MAFIA: a max-
imal frequent itemset algorithm for transactional databases,”
Proceedings of International Conference on Data Engineering,
Heidelberg, Germany, pp. 443-452, April 2001,

You might also like