0% found this document useful (0 votes)
26 views7 pages

Comparing Dataset Characteristics That Favor The Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms

Uploaded by

fetnbadani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

Comparing Dataset Characteristics That Favor The Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms

Uploaded by

fetnbadani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Comparing Dataset Characteristics that Favor the

Apriori, Eclat or FP-Growth Frequent Itemset


Mining Algorithms
Jeff Heaton
College of Engineering and Computing
Nova Southeastern University
Ft. Lauderdale, FL 33314
Email: [email protected]
arXiv:1701.09042v1 [cs.DB] 30 Jan 2017

Abstract—Frequent itemset mining is a popular data mining mented by Christian Borgelt in 2012[9]. Though, association
technique. Apriori, Eclat, and FP-Growth are among the most rule mining is a similar algorithm, this research is limited to
common algorithms for frequent itemset mining. Considerable frequent itemset mining. By limiting the experimentation to a
research has been performed to compare the relative performance
between these three algorithms, by evaluating the scalability of single implementation of frequent itemset mining this research
each algorithm as the dataset size increases. While scalability is able to evaluate how the characteristics of the dataset affect
as data size increases is important, previous papers have not the performance of these algorithms.
examined the performance impact of similarly sized datasets that
contain different itemset characteristics.
This paper explores the effects that two dataset characteristics II. F REQUENT I TEMSET M INING
can have on the performance of these three frequent itemset algo-
rithms. To perform this empirical analysis, a dataset generator Frequent itemset mining was introduced as a means to find
is created to measure the effects of frequent item density and frequent groupings of items in a database containing basket-
the maximum transaction size on performance. The generated s/transactions of these items[10]. The database is composed
datasets contain the same number of rows. This provides some
of a series of baskets that are analogous to orders placed by
insight into dataset characteristics that are conducive to each
algorithm. The results of this paper’s research demonstrate Eclat customers. These orders are individual baskets that are made
and FP-Growth both handle increases in maximum transaction up of some number of items. Companies, such as Amazon,
size and frequent itemset density considerably better than the Netflix and other online retailers, make use of frequent item-
Apriori algorithm. sets to suggest additional items that a consumer might want
to purchase, based on their past purchasing history and the
I. I NTRODUCTION
history of others with similar baskets[11]. The following data
The research covered by this paper determines how the show baskets that might be used for frequent itemset mining,
characteristics of a dataset might affect the performance where each line represents a single basket of items.
of the Apriori, Eclat, and FP-Growth frequent itemset
mining algorithms. These algorithms have several popular [ m p 3 p l a y e r usb−c h a r g e r book−d c t book−t h s ]
implementations[1], [2], [3]. The goal of this research is [ m p 3 p l a y e r usb−c h a r g e r ]
to determine the effects of basket size and frequent itemset [ usb−c h a r g e r m p 3 p l a y e r book−d c t book−t h s ]
density on the Apriori, Eclat, and FP-Growth algorithms. The [ usb−c h a r g e r ]
research determined that these two dataset characteristics have [ book−d c t book−t h s ]
a significant impact on performance of the algorithms.
From the above baskets several frequent itemsets can be
Most research into frequent itemset mining focuses upon the
defined. These are sets of items that frequently occur together,
performance differences between frequent itemset algorithms
some of which are:
on a single dataset[4]. The effects of hyper-paramaters, such as
minimum support, upon the performance of frequent itemset [ m p 3 p l a y e r usb−c h a r g e r ]
mining algorithms has also been explored[5]. Some papers [ book−d c t book−t h s ]
make use of common datasets from the UCI Machine Learning ...
Repository[6]. Many papers make use of the IBM Quest
Synthetic Data Generator[7] or some variant of it. Our paper A simple visual analysis of the data show that the items
makes use of a Python-based generator that is based on IBM’s mp3-player and usb-charger frequently occur together. Like-
work[8]. wise, book-dct and book-ths also frequently occur together.
This research evaluates the performance of the Apriori, Eclat Frequent itemset algorithms make use of a variety of statistics
and FP-Growth frequent itemset mining algorithms imple- to determine which itemsets to include.
III. S URVEY OF A PRIORI , E CLAT AND FP-G ROWTH threshold level of support. The constant S or σ typically rep-
There are a variety of different algorithms that are used resents the support threshold. Computing all possible itemsets
to mine frequent itemsets. First, a simple naive brute-force is only an O(N ) magnitude operation in all cases, where N
algorithm to build frequent itemsets will be evaluated. This is the number of baskets in the database. However, the naive
paper shows how Apriori, Eclat and FP-Growth address some algorithm would also need 2i memory cells to store all of these
of the shortcomings of the naive algorithm. itemsets as the counts are generated, where i is the number of
All four algorithms must calculate statistics about itemsets individual items. These memory cells would typically be 32 or
that might ultimately be included in the final collection of 64-bit integers. This memory requirement means that the naive
frequent itemsets. One statistic that is common to all four algorithm is impossible for anything but a trivial number of
of these algorithms is support. The support of a candidate individual items. A computer with 128GB of available RAM
frequent itemset is the total count of how many of the database would theoretically only be able to calculate 34 items, when
baskets support that candidate. A basket is said to cover a using a 64-bit integer to hold the counts. When it is considered
candidate itemset if the candidate is a subset or equal to that N might be the total count of distinct items for sale by
the basket. Support is sometimes expressed as a percent of a retailer such as Walmart or Amazon it is obvious that the
the total number of baskets in the database (N ) that cover a naive approach is not useful in practice.
candidate itemset (X). The following formula calculates the V. NAIVE A LGORITHM E XAMPLE
support percentage of a candidate itemset: This section demonstrates how the naive algorithm would
Xcount handle the example basket set given earlier in this paper. The
supp(X) = (1) total number of items contained in database, |J | is equal to
N
four. Four items can be arranged a total of |J |2 , or 16 different
This equation can be applied to calculate the support for ways. However, because one of these frequent itemsets is
{mp3-player usb-charger} from the previously presented set the empty set, only the following 15 candidate itemsets are
of baskets. considered:
3 [ book−t h s ]
supp({mp3-player usb-charger}) = = 0.6 (2)
5 [ book−d c t ]
The support statistic of 0.6 indicates that 60% of the [ book−d c t , book−t h s ]
five baskets contain the candidate itemset {mp3-player usb- [ usb−c h a r g e r ]
charger}. Most frequent itemset algorithms accept a minimum [ usb−c h a r g e r , book−t h s ]
support parameter to filter out less common itemsets. [ usb−c h a r g e r , book−d c t ]
[ usb−c h a r g e r , book−d c t , book−t h s ]
IV. NAIVE A LGORITHM FOR F INDING F REQUENT [ mp3player ]
I TEMSETS [ m p 3 p l a y e r , book−t h s ]
It is not difficult to extract frequent itemsets from basket [ m p 3 p l a y e r , book−d c t ]
data. It is, however, difficult to do so efficiently. For the [ m p 3 p l a y e r , book−d c t , book−t h s ]
algorithms presented here, let J represent a set of items, [ m p 3 p l a y e r , usb−c h a r g e r ]
likewise, let D represent a database of baskets that are made [ m p 3 p l a y e r , usb−c h a r g e r , book−t h s ]
up of those items. Algorithm 1 is a summarization of the [ m p 3 p l a y e r , usb−c h a r g e r , book−d c t ]
naive frequent itemset algorithm provided by Garcia-Molina, [ m p 3 p l a y e r , usb−c h a r g e r , book−d c t , book−t h s ]
Ullman, and Widom[12]. The above itemsets are considered candidate frequent item-
sets because it has not yet been determined if all of these
Algorithm 1 Naive Frequent Itemset Algorithm candidates will be included in the final list of frequent itemsets.
1: INPUT: A file D consisting of baskets of items. Once the candidate itemsets have been determined, the naive
2: OUTPUT: The sets of itemsets F1 , F2 , . . . , Fq , where Fi algorithm will pass over all baskets and count the support for
is the set of all itemsets of size I that appear in at least s each of the candidate itemsets. Candidate itemsets that are be-
baskets of D. low the required support S will be purged. The naive algorithm
3: METHOD: would calculate support for each candidate as follows:
4: R ← integer array, all item combinations in D, of size 2|D| [ book−t h s ] ; s = 3
5: for n ← 1 T O |D| do [ book−d c t ] ; s = 3
6: F ← all possible set combinations from Dn [ book−d c t book−t h s ] ; s = 3
7: Increase each value in R, corresponding to each in F [] [ usb−c h a r g e r ] ; s = 4
return all itemsets, with R[] ≥ s
[ usb−c h a r g e r , book−t h s ] ; s = 2
[ usb−c h a r g e r , book−d c t ] ; s = 2
The naive algorithm simply generates all possible itemsets, [ usb−c h a r g e r , book−d c t , book−t h s ] ; s = 2
counts their support, and then discards all itemsets below some [ mp3player ] ; s = 3
[ mp3player , book−t h s ] ; s = 2 (2003) and others do not represent Apriori’s performance in
[ mp3player , book−d c t ] ; s = 2 terms of big-O notation[14]. This is likely due to the fact that
[ mp3player , book−d c t , book−t h s ] ; s = 2 Apriori’s outer loops are bounded by the number of common
[ mp3player , usb−c h a r g e r ] ; s = 3 prefixes and not some easily determined constants such as the
[ mp3player , usb−c h a r g e r , book−t h s ] ; s = 2 number of items or the length of the dataset. Papers describing
[ mp3player , usb−c h a r g e r , book−d c t ] ; s = 2 Apriori, Eclat, and FP-Growth rely on empirical comparison
[ mp3player , usb−c h a r g e r , book−d c t , of algorithms rather than big-O analysis. However, analysis
book−t h s ]; s = 2 covered later in this paper does allow these three algorithms
to be expressed in big-O based on average basket size and
It is necessary to store a count for every possible itemset when
frequent itemset density.
using the naive algorithm. Of course, once the support counts
Aprioiri first builds a list of all singleton itemsets with
are determined, many of the frequent itemsets will be purged.
sufficient support. Building on the monotonicity principle, the
Nevertheless, the fact that these values must be kept while
next set of candidate frequent itemsets is built of combinations
the database is scanned for support means the naive algorithm
of the singleton itemsets. This process continues until the
requires considerable memory.
maximum length specified for frequent itemsets is reached.
VI. A PRIORI A LGORITHM FOR F INDING F REQUENT The evaluations performed by this research did not impose
I TEMSETS this maximum length.
Agrawal and Srikant initially introduced the Apriori al- The primary issue with the Apriori algorithm is that it is
gorithm to provide performance improvements over a naive necessary to perform a scan of the database at every level of
itemset search[13]. Apriori algorithm has been around almost the breadth-first search. Additionally, candidate generation can
as long as the concept of frequent itemsets and is very popular. lead to a great number of subsets and can become a significant
The naive algorithm is a theoretical concept and is not used memory requirement. Deficiencies in the Apriori algorithm led
in practice. Aprioiri has become the classic implementation to the development of other, more efficient, algorithms, such
of frequent itemset mining. Aprioiri, as defined by Goethals as Eclat and FP-Growth.
(2003) is presented as Algorithm 2[14].
VII. A PRIORI A LGORITHM E XAMPLE
Algorithm 2 Apriori Frequent Itemset Algorithm This section will demonstrate how the Apriori algorithm
1: INPUT: A file D consisting of baskets of items, a support handles the basket set given earlier in this paper. The Apriori
threshold σ. algorithm performs a breadth first search of the itemsets.
2: OUTPUT: A list of itemsets F(D, σ). Figure 1 shows a segment of this search, for the items usb-
3: METHOD: cable, mp3-player, and book-dct.
4: C1 ← {{i}|i ∈ J }
5: k ← 1
6: while Ck 6= {} do {}

7: # Compute the supports of all candidate itemsets


8: for all transactions {tid, I} ∈ D do
9: for all candidate itemsets X ∈ Ck do {mp3-player} {book-dct}
{usb-cable}
10: if X ⊆ I then
11: X.support++
12: # Extract all frequent itemsets
13: Fk = {X|X.support > σ} {usb-cable
mp3-player}
{usb-cable
book-dct}
{mp3-player
book-dct}
14: # Generate new candidate itemsets
15: for all X, Y ∈ F i, X[i] = Y [i] for 1 ≤ i ≤ k −
1, and X[k] < Y [k] do
{usb-cable
16: I = X ∪ {Y [k]} mp3-player
book-dct}
17: if ∀J ⊂ I, |J| = k : J ∈ Fk then
18: Ck+1 ← Ck+1 ∪ I
19: k++ Fig. 1. Apriori Breadth-First Search

Apriori is based on the hierarchical monotonicity of frequent The candidate set starts empty, and begins by adding all
itemsets between their supersets and subsets. As implied by singleton itemsets that have sufficient individual support. For
monotonicity, a subset of a frequent itemset must also be simplicity, it is assumed that only usb-cable, mp3-player, and
frequent. Likewise, a superset of an infrequent itemset must book-dct have sufficient support. The next layer is built of
also be infrequent[13]. This allows the Apriori algorithm to combinations of the previous layer that had sufficient support.
be implemented as a breadth-first search. Papers by Goethals For simplicity, it is also assumed that all three combinations
had sufficient support. Finally, a triplet itemset with all three are encountered, they are added to the trie by inserting a node
items is evaluated. for each item that makes up the itemset. The left-most item
corresponds to a child of the root node. The second item
VIII. E CLAT A LGORITHM FOR F INDING F REQUENT corresponds to a child of the first item of this frequent set.
I TEMSETS No parent would ever have more than one child of the same
Eclat was introduced by Zaki, Parthasarathy, Ogihara, and item name; however, an item name may appear at multiple
Li in 1997[15]. Eclat is an acronym for Equivalence Class locations in the trie.
Clustering and bottom up Lattice Traversal. The primary The trie is generated so that the algorithm can quickly find
difference between Eclat and Apriori is that Eclat abandons the support of an itemset by traversing the trie as the items
Apriori’s breadth-first search for a recursive depth-first search. in the set are read left-to-right. The node that contains the
Eclat, as defined by Goethals (2003) is presented as Algorithm right-most item contains the support for that itemset. As the
3[14]. algorithm processes the database the trie is traversed looking
for each itemset discovered. Nodes are created, if necessary,
Algorithm 3 Eclat Frequent Itemset Algorithm to fill out the trie to hold all itemsets. If the nodes already
1: INPUT: A file D consisting of baskets of items, a support exist, the node for the right-most item in the itemset has its
threshold σ, and an item prefix I, such that I ⊆ J . support increased. New nodes start with a support of 1. This
2: OUTPUT: A list of itemsets F[I](D, σ) for the specified allows Eclat to use less memory than Apriori, because the core
prefix. branches of the trie allow heavily used subsets to be stored
3: METHOD: only once. Theoretically, a trie could be used with Apriori,
4: F[i] ← {} however, the breadth-first nature of Apriori would typically
5: for all i ∈ J occurring in D do require too much memory.
6: F[I] := F[I] ∪ {I ∪ {i}} IX. E CLAT A LGORITHM E XAMPLE
7: # Create Di
This section will demonstrate how the Eclat algorithm
8: Di ← {}
would handle the basket set given earlier in this paper. A
9: for all j ∈ J occurring in D such that j > i do
portion of the trie built by Eclat is shown in Figure 2.
10: C ← cover({i}) ∩ cover({j})
11: if |C| ≥ σ then
12: Di ← Di ∪ {j, C} null

13: # Depth-first recursion


14: Compute F[I ∪ i](Di , σ)
mp3-player: 3 book-dct: 3 usb-charger: 1
15: F[I] := F[I] ∪ F[I ∪ i]
usb-charger: 3 book-ths: 3
The input parameters to Eclat are slightly different than
Apriori in that a prefix I is provided. This prefix specifies the book-dct: 2
prefix pattern that must be present in any itemsets found by
the call to Eclat. This change allows a depth-wise recursive
book-ths: 2
building of the itemsets. The initial call to Eclat uses an I
value of {}, meaning that no specific prefix is required. This
initial call would find all single-item frequent itemsets. The Fig. 2. A Trie Used by Eclat
Apriori algorithm would then recursively call itself, each time
increasing I by adding itemsets that contain the value I that The above trie portion encodes a total of 7 different frequent
the function was called with, but are one item longer. This itemset’s support values. To find a particular frequent itemset’s
process would continue until the value of I had grown to support simply traverse the graph, following the items from
sufficient length that the algorithm has traversed to baskets left to right. The frequent itemset mp3-player, usb-charger
of all lengths. Like Apriori, Eclat is not usually expressed would have a support value of 3. Similarly, the frequent itemset
in big-O terms; however, results obtained from this research’s mp3-player, usb-charger, book-dct would have a support value
experimentation show how frequent itemset density and basket of 2. Once the algorithm completes, the trie is traversed, and
allow these algorithms to be expressed in terms of big-O all frequent itemsets are extracted from it.
computational cost.
There are several different methods for storing the support X. FP-G ROWTH A LGORITHM FOR F INDING F REQUENT
values in the recursive Eclat algorithm. The most common I TEMSETS
approach is to use a structure called a trie. This is the approach Frequent pattern growth (FP-Growth) was introduced by
used by Borgelt (2012) to implement the versions of Apriori, Han, Pei, and Yin in 2000 to forego candidate generation
Eclat and FP-Growth investigated in this research paper[9]. A altogether[16]. This is done by using a trie to store the actual
trie graph always contains an empty root node. As itemsets baskets, rather than storing candidates like Apriori and Eclat
do. Aprori is very much a horizontal, breadth-first, algorithm. encoded baskets, along with their supports. The header, on the
Similarly, Eclat is very much a vertical, depth-first, algorithm. left, holds the items and provides horizontal access to the data.
The trie structure of FP-Growth provides a vertical view of the
data. However, FP-Growth also adds a header table for every XII. E MPIRICAL C OMPARISON OF A PRIORI , E CLAT AND
individual item that has support above the threshold support FP-G ROWTH
level. This header table contains a linked-list through the trie to There are a number of papers that compare the computa-
connect every node of the same type. The header table gives tional performance of Apriori, Eclat and FP-Growth. These
FP-Growth a horizontal view of the data, in addition to the papers are typically focused primarily on comparing the dif-
vertical view provided by the trie. ferences between the algorithms on one or more datasets and
The algorithm for FP-Growth is similar to Eclat in that it at different support thresholds. Papers by Borgelt (2012) and
was not expressed in terms of big-O analysis. FP-Growth, as Goethals (2003) are examples of papers that compare various
defined by Goethals, is presented as Algorithm 4[14]. implementations of Apriori, Eclat and FP-Growth[9], [14].
This paper attempts a different approach. The goal of this
Algorithm 4 FP-Growth Frequent Itemset Algorithm paper is to see what effect the dataset has on the algorithm.
1: INPUT: A file D consisting of baskets of items, a support The average basket size and frequent item density will be
threshold σ, and an item prefix I, such that I ⊆ J . used as independent variables to evaluate total processing
2: OUTPUT: A list of itemsets F[I](D, σ) for the specified time as the dependent variable. Apriori, Eclat and FP-Growth
prefix. will each be evaluated independently to see which performs
3: F[i] ← {} best at different basket sizes and frequent itemset densities.
4: for all i ∈ J occurring in D do Performance will be measured as a shorter total runtime. This
5: F[I] ← F[I] ∪ {I ∪ {i} paper focuses on a single implementation of these algorithms
6: # Create Di using the implementations of Apriori, Eclat and FP-Growth
7: Di ← {} by Borgelt in 2012[9].
8: H ← {}
XIII. DATASET G ENERATION
9: for all j ∈ J occurring in D such that j > i do
10: if support(I ∪ {i, j}) ≥ σ then Generated datasets are used to perform this evaluation.
11: H ← H ∪ {j} This generated data allows the two independent variables
12: for all (tid, X) ∈ D with I ∈ X do to be adjusted to create a total of 20 different datasets to
13: Di ← Di ∪ {(tid, X ∩ H)} perform the evaluations. The datasets were generated using
a simple Python script created for this paper that can be
14: # Depth-first recursion
found on GitHub[8]. This Python script accepts the following
15: Compute F[I ∪ {i}](Di , σ)
parameters to specify the dataset to produce:
16: F[I] ← F[I] ∪ F[I ∪ {i}]
• Transaction/Basket count: 10 million default
• Number of items: 50,000 default
• Number of frequent sets: 100 default
XI. FP-G ROWTH A LGORITHM E XAMPLE • Max transaction/basket size: independent variable, 5-100
This section will demonstrate how the FP-Growth algorithm range
would handle the basket set given earlier in this paper. Figure • Frequent set density: independent variable, 0.1 to 0.8
3 shows a portion of the FP-Growth trie and the header table range
generated for the earlier example data. While basket count, number of frequent sets, and number
of items can easily be varied in the script, for the purposes of
null this paper they will remain fixed at the above values. Through
informal experimentation it was determined that the basket
mp3-player: 3
count only had a small positive correlation to processing time.
usb-charger: 1
The number of items did not appear to have a meaningful
usb-charger 4 usb-charger: 3 impact on processing time when varied in isolation. It was
mp3-player 3
book-dct: 3 observed that the strongest correlation to processing time was
book-dct 3
book-ths 3
book-dct: 2 through variation of the maximum basket size and frequent set
book-ths: 3 density.
book-ths: 2 The following listing shows the type of data generated
for this research. Here an example file was created with 10
baskets, out of 100 items, 2 frequent itemsets, maximum
Fig. 3. FP-Growth Trie and Header Table
basket size of 10, and a density of 0.5.
This figure demonstrates the horizontal and vertical nature I36 I94
of the FP-Growth algorithm. The trie, on the right, holds the I71 I13 I91 I89 I34
F6 F5 F3 F4 test machine’s 8 gigabytes of RAM. This made swapping to
I86 physical storage necessary and had a drastic impact on the
I39 I16 I49 I62 I31 I54 I91 algorithm’s runtime. It is also interesting to note that Eclat is
I22 I31 marginally ahead of FP-Growth at low densities. This ranking
I70 I85 I78 I63 reverses at higher densities. Between 10% and 70% all three
F4 F3 F1 F6 F0 I 6 9 I 4 4 algorithms exhibit approximately an O(N logN ) complexity.
I82 I50 I9 I31 I57 I20 Beyond 70% Apriori approached O(N 2 ) and worse complex-
F4 F3 F1 F6 F0 I 8 7 ity, where N is the number of actual frequent items in the
database.
As you can see from the above file, the items are either
prefixed with “I” or “F”. The “F” prefix indicates that this XV. E FFECTS OF BASKET S IZE
line contains one of the frequent itemsets. Items with the “I”
Basket size specifies the maximum number of items per
prefix are not part of an intentional frequent itemset. Of course,
basket line. Larger basket sizes mean that the frequent itemsets
“I” prefixed items might form frequent itemsets, as they are
will also be larger. This increases the sizes of the data struc-
uniformly sampled from the number of items to fill out non-
tures used to hold these itemsets. These larger data structures
frequent itemsets. Each basket will have a random size chosen,
require more memory for storage and greater processing time
up to the maximum basket size. The frequent itsemset density
to traverse them. Figure 5 illustrates the effect of increasing
specifies the probability of each line containing one of the
transaction sizes on the performance of the three algorithms.
intentional frequent itemsets. Because a density of 0.5 was
used, approximately half of the lines above contain one of
the two intentional frequent itemsets. A frequent itemset line
may have additional random “I” prefixed items added to cause
the line to reach the randomly chosen length for that line. If
the chosen frequent itemset does cause the generated line to
exceed its randomly chosen length, no truncation will occur.
The intentional frequent itemsets are all chosen to be less than
or equal to the maximum basket size.
XIV. E FFECTS OF DATASET D ENSITY
Dataset density specifies the percentage of baskets that
intentionally contain frequent itemsets. As frequent itemset
density increases so does the processing time of Apriori, Eclat
and FP-Growth as shown by Figure 4.
Fig. 5. Maximum Basket Size’s Effect on Runtime (seconds)

This chart shows the results of running 10 million baskets,


with a frequent itemset density of 50%, at various max basket
sizes. The three algorithms show almost exactly the same
O(N ) performance for basket sizes up through 60. Once above
60, Apriori seems to grow much quicker than the other two.
This is possibly because of the increased memory used by
Apriori. Interestingly, Apriori performed the best between 60-
70 maximum transaction sizes. Further research is needed to
determine why Apriori is superior in this small range.
XVI. C ONCLUSIONS
Apriori is an easily understandable frequent itemset mining
Fig. 4. Frequent Itemset Density’s Effect on Runtime (seconds) algorithm. Because of this, Apriori is a popular starting
point for frequent itemset study. However, Apriori has serious
This chart shows the results of running 10 million bas- scalability issues and exhausts available memory much faster
kets, with an average basket size of 50, at various densities. than Eclat and FP-Growth. Because of this Apriori should not
The Eclat and FP-Growth algorithms both show very similar be used for large datasets.
growth as frequent itemset density increases. The Apriori al- Most frequent itemset applications should consider using
gorithm also performs very similarly to Eclat and FP-Growth, either FP-Growth or Eclat. These two algorithms performed
until the density surpasses 70%. As previously mentioned in similarly for this paper’s research, though FP-Growth did
this paper, Apriori has considerably larger memory needs than show slightly better performance than Eclat. Other papers also
the other algorithms. At 70% Apriori had allocated all of the recommend FP-Growth for most cases[9]. Frequent itemset
mining is an area of active research. New algorithms, as well
as modifications of existing algorithms are often introduced.
For an application where performance is critical, it is important
to evaluate the dataset with newer algorithms as they are
introduced, and shown to have better performance than FP-
Growth or Eclat.
R EFERENCES
[1] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: Current
status and future directions,” Data Mining Knowledge Discovery, vol. 15,
no. 1, pp. 55–86, Aug. 2007.
[2] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H.
Witten, “The weka data mining software: an update,” ACM SIGKDD
explorations newsletter, vol. 11, no. 1, pp. 10–18, 2009.
[3] M. Hahsler, B. Gruen, and K. Hornik, “arules – A computational
environment for mining association rules and frequent item sets,”
Journal of Statistical Software, vol. 14, no. 15, pp. 1–25, October
2005. [Online]. Available: http://www.jstatsoft.org/v14/i15/
[4] D. Burdick, M. Calimlim, and J. Gehrke, “Mafia: A maximal frequent
itemset algorithm for transactional databases,” in Proceedings of the 17th
International Conference on Data Engineering, 2001. IEEE, 2001, pp.
443–452.
[5] Z. Zheng, R. Kohavi, and L. Mason, “Real world performance of asso-
ciation rule algorithms,” in Proceedings of the seventh ACM SIGKDD
international conference on Knowledge discovery and data mining.
ACM, 2001, pp. 401–406.
[6] A. Asuncion and D. Newman, “UCI machine learning repository,”
2007. [Online]. Available: http://www.ics.uci.edu/$\sim$mlearn/{MLR}
epository.html
[7] A. Pitman, “Market basket synthetic data generator,” 2011, http://mloss.
org/software/view/294/.
[8] J Heaton, “Jeff Heaton’s GitHub Repository - Conference/Paper Source
Code,” https://github.com/jeffheaton/papers, accessed: 2016-01-31.
[9] C. Borgelt, “Frequent item set mining,” Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, vol. 2, no. 6, pp. 437–456,
2012.
[10] R. Agrawal, T. Imieliński, and A. Swami, “Mining association rules
between sets of items in large databases,” in ACM SIGMOD Record,
vol. 22, no. 2. ACM, 1993, pp. 207–216.
[11] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of massive
datasets. Cambridge University Press, 2014.
[12] H. Garcia-Molina, Database systems: the complete book. Pearson
Education India, 2008.
[13] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association
rules,” in Proceedings of the 20th international conference of very large
data bases, VLDB, vol. 1215, 1994, pp. 487–499.
[14] B. Goethals, “Survey on frequent pattern mining,” University of Helsinki,
2003.
[15] M. J. Zaki, S. Parthasarathy, M. Ogihara, W. Li et al., “New algorithms
for fast discovery of association rules,” in KDD, vol. 97, 1997, pp. 283–
286.
[16] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate
generation,” in ACM SIGMOD Record, vol. 29, no. 2. ACM, 2000, pp.
1–12.

You might also like