Assoc Parallel Journal
Assoc Parallel Journal
Y, MONTH 1999
Abstract | In this paper we propose two new parallel formulations of the Apriori algorithm that is used for computing
association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel
formulations CD and DD. Unlike the CD algorithm, the IDD
algorithm partitions the candidate set intelligently among
processors to eciently parallelize the step of building the
hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller
communication overhead than DD. But IDD suers from the
added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor
Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales
as well as IDD with respect to increasing candidate set size.
Keywords | Data mining, parallel processing, association
rules, load balance, scalability.
I. Introduction
One of the important problems in data mining [1] is discovering association rules from databases of transactions,
where each transaction contains a set of items. The most
time consuming operation in this discovery process is the
computation of the frequencies of the occurrence of subsets
of items, also called candidates, in the database of transactions. Since usually such transaction-based databases
contain a large number of distinct items, the total number of candidates is prohibitively large. Hence, current
association rule discovery techniques [2], [3], [4], [5] try to
prune the search space by requiring a minimum level of
support for candidates under consideration. Support is a
measure of the number of occurrences of the candidates in
database transactions. Apriori [2] is a recent state-of-theart algorithm that aggressively prunes the set of potential
candidates of size k by using the following observation: a
candidate of size k can meet the minimum level of support only if all of its subsets also meet the minimum level
of support. In the kth iteration, this algorithm computes
the occurrences of potential candidates of size k in each
of the transactions. To do this task eciently, the algorithm maintains all potential candidates of size k in a hash
tree. This algorithm does not require the transactions to
stay in main memory, but requires the hash trees to stay in
main memory. If the entire hash tree cannot t in the main
memory, then the hash tree needs to be partitioned, and
multiple passes over the transaction database need to be
performed (one for each partition of the hash tree). Even
with the highly eective pruning method of Apriori, the
task of nding all association rules in many applications
can require a lot of computation power that is available
II. Basic Concepts
only in parallel computers.
Let T be the set of transactions where each transaction
Two parallel formulations of the Apriori algorithm were
proposed in [6], Count Distribution (CD) and Data Dis- is a subset of the item-set I . Let C be a subset of I , then
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
TABLE I
TID
1
2
3
4
5
Items
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk
1. F1 = f frequent 1-item-setsg ;
2. for ( k = 2; Fk 1 6= ; k++ ) f
3.
Ck = apriori gen(Fk 1 )
4.
for all transactions t 2 T f
5.
subset(Ck , t)
6.
g
7.
Fk = fc 2 Ck j c.count minsupg
8. g
S
9. Answer = Fk
Fig. 1. Apriori Algorithm
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
3
1 +
2356
2 +
356
3 +
56
Hash Function
Transaction
1,4,7
3,6,9
12356
2,5,8
1 +
Transaction
2 +
2356
356
12356
1 2 +
3 5 6
1 3 +
5 6
234
567
1 5 +
145
3 +
567
136
345
356
367
357
368
689
125
457
458
356
367
357
368
689
124
125
457
458
159
124
345
56
145
136
159
sorted order, each candidate set is generated in sorted order without any need for explicit sorting. Each candidate
item-set is inserted into the hash tree by hashing each successive item at the internal nodes and then following the
links in the hash table. Once a leaf is reached, the candidate item-set is inserted at the leaf if the total number of
candidate item-sets are less than the maximum allowed. If
the total number of candidate item-sets at the leaf exceeds
the maximum allowed and the depth of the leaf is less than
k, the leaf node is converted into an internal node and child
nodes are created for the new internal node. The candidate
item-sets are distributed to the child nodes according to the
hash values of the items. For example, the candidate item
set f1 2 4g is inserted by hashing item 1 at the root to reach
the left child node of the root, hashing item 2 at that node
to reach the middle child node, hashing item 3 to reach the
left child node which is a leaf node.
The subset function traverses the hash tree from the root
with every item in a transaction as a possible starting item
of a candidate. In the next level of the tree, all the items
of the transaction following the starting item are hashed.
This is done recursively until a leaf is reached. At this
time, all the candidates at the leaf are checked against the
transaction and their counts are updated accordingly. Figure 2 shows the subset operation at the rst level of the
tree with transaction f1 2 3 5 6g. The item 1 is hashed to
the left child node of the root and the following transaction f2 3 5 6g is applied recursively to the left child node.
The item 2 is hashed to the middle child node of the root
and the whole transaction is checked against two candidate
item-sets in the middle child node. Then item 3 is hashed
to the right child node of the root and the following transaction f5 6g is applied recursively to the right child node.
Figure 3 shows the subset operation on the left child node
of the root. Here the items 2 and 5 are hashed to the middle child node and the following transactions f3 5 6g and
f6g respectively are applied recursively to the middle child
node. The item 3 is hashed to the right child node and the
remaining transaction f5 6g is applied recursively to the
right child node.
III. Parallel Algorithms
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
Proc 0
Proc 1
Data
Proc 2
Data
N/P
Data
N/P
Count
N/P
Count
Proc 3
Data
N/P
Count
Count
Candidate Hash Tree
{1, 2}
{1, 2}
{1, 2}
{1, 2}
{1, 3}
{1, 3}
{1, 3}
{1, 3}
{2, 3}
{2, 3}
{2, 3}
{2, 3}
{2, 4}
{2, 4}
{2, 4}
{2, 4}
{3, 4}
{3, 4}
{3, 4}
{3, 4}
{4, 5}
{4, 5}
{4, 5}
{4, 5}
Global Reduction
N: number of data items
M: size of candidate set
P: number of processors
Proc 0
Local Data
Data
Proc 1
Remote Data
Local Data
Data
N/P
Broadcast
Proc 2
Remote Data
N/P
Data
Broadcast
Count
Count
M/P
{1, 2}
{2, 5}
{4, 6}
Local Data
Proc 3
Remote Data
N/P
Data
Broadcast
Count
Count
M/P
{1, 3}
{3, 4}
{4, 7}
Local Data
Remote Data
Data
N/P
Broadcast
Broadcast
Count
Count
M/P
{1, 4}
{3, 5}
{5, 6}
All-to-all Broadcast
Count
Count
M/P
{2, 3}
{4, 5}
{5, 7}
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
while (!done) f
FillBuer(fd, SBuf);
for (k = 0; k < P-1; ++k) f
/* send/receive data in non-blocking pipeline */
MPI Irecv(RBuf, left);
MPI Isend(SBuf, right);
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);
MPI Waitall();
/* swap two buers */
tmp = SBuf;
SBuf = RBuf;
RBuf = tmp;
g
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
Proc 0
Local Data
Data
Proc 1
Remote Data
N/P
Local Data
Data
Shift
Proc 2
Remote Data
N/P
Count
Count
Bit Map
1, 7
Remote Data
N/P
Local Data
Data
Data
N/P
Count
Count
Bit Map
Shift
Count
Count
Bit Map
Remote Data
Shift
2, 5
M/P
Proc 3
Shift
Count
Bit Map
Local Data
Data
Shift
Count
3, 6
{1, 2}
{2, 3}
{4, 5}
{3, 4}
{1, 3}
{2, 5}
{4, 6}
{3, 5}
{7, 8}
{5, 6}
{4, 7}
{6, 7}
M/P
M/P
M/P
All-to-all Broadcast
N: number of data items
M: size of candidate set
P: number of processors
Transaction
bitmap
12356
1, 3, 5
1 +
2356
2 +
356
3 +
56
Skipped!!
567
145
124
136
125
345
356
367
357
368
159
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
Step 1: Partitioning of Candidate Sets and Data Movement Along the Columns
Candidate Hash Tree
1, 2
1, 2
1, 2
1, 2
4, 5
4, 5
4, 5
4, 5
7, 8
7, 8
7, 8
7, 8
Data Shift
Data Shift
Data Shift
Data Shift
2, 3
5, 6
8, 9
Data Shift
2, 3
5, 6
8, 9
Data Shift
2, 3
5, 6
8, 9
Data Shift
2, 3
5, 6
8, 9
Data Shift
Data Shift
Data Shift
Data Shift
3, 4
3, 4
3, 4
3, 4
6, 7
6, 7
6, 7
6, 7
6, 8
6, 8
6, 8
6, 8
Data Shift
1, 2
1, 2
1, 2
1, 2
4, 5
4, 5
4, 5
4, 5
7, 8
7, 8
7, 8
7, 8
2, 3
2, 3
2, 3
2, 3
5, 6
5, 6
5, 6
5, 6
8, 9
8, 9
8, 9
8, 9
3, 4
3, 4
3, 4
3, 4
6, 7
6, 7
6, 7
6, 7
6, 8
6, 8
6, 8
6, 8
7, 8
7, 8
7, 8
7, 8
8, 9
8, 9
8, 9
8, 9
8, 9
7, 8
8, 9
7, 8
8, 9
7, 8
8, 9
7, 8
7, 8
7, 8
7, 8
8, 9
8, 9
8, 9
8, 9
All-to-all
Broadcast
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
10
TABLE II
Processor configuration and number of candidates of the HD algorithm with 64 processors and with m = 50K at each pass.
Note that 64 1 configuration is the same as the IDD algorithm and 1 64 is the same as the CD algorithm. The total
number of pass was 13 and all passes after 6 had 1 64 configuration.
Pass
Conguration
No of Cand.
8 8 64 1 4 16 2 32 2 32 1 64
351K 4348K 115K 76K
56K
34K
E = PTserial
T
A parallel algorithm is scalable if P Tp and Tserial remain of the same order [9]. The problem size (i.e., the
serial runtime) for the Apriori algorithm increases either
by increasing N or by increasing M (as a result of lowering the minimum support) in the algorithms discussed in
Section III. Table III describes the symbols used in this
section.
As discussed in Section II, each iteration of the algorithm consists of two steps: (i) candidate generation and
hash tree construction (ii) computation of subset function
for each transaction. The derivation of the runtime of the
subset function is much more involved. Consider a transacth
tion that has I items. During
the
k pass of the algorithm,
I
this transaction has C = k potential candidates that
need to be checked against the candidate hash tree. Note
that for a given transaction, if checking for one potential
candidate leads to a visit to a leaf node, then all the candidates of this transaction are checked against the leaf node.
As a result, if this node is revisited due to a dierent candidate from the same transaction, no checking needs to
be performed. Clearly the total cost of checking at the leaf
nodes is directly proportional to the number of distinct leaf
nodes visited with the transaction. We assume that the average number of candidate item-sets at the leaf nodes is S .
Hence the average number of leaf nodes in a hash tree is
L = M=S . In the implementation of the algorithm, the desired value of S can be obtained by adjusting the branching
factor of the hash tree. In general, the cost of traversal for
each potential candidate will depend on the depth of the
leaf node in the hash tree reached by the traversal. To simplify the analysis, we assume that the cost of each traversal
is the same. Hence, the total traversal cost is directly proportional to C . For each potential candidate, we dene
ttravers to be the cost associated with the traversal of the
hash tree and tcheck to be the cost associated with checking
the candidate item-sets of the reached leaf node.
Note that the number of distinct leaves checked by a
transaction is in general smaller than the number of potential candidates C . This is because dierent potential
candidates may lead to the same leaf node. In general, if C
is relatively large with respect to the number of leaf nodes
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
11
TABLE III
symbol
N
P
M
G
k
I
C
S
L
denition
Total number of transactions
Number of processors
Total number of candidates
Number of partitions of candidates in the HD algorithm
Pass number in Apriori algorithm
Average number of items in a transaction
Average number of potential candidates in a transaction
Average number of candidates at the leaf node
Average number of leaves in the hash tree for the serial Apriori algorithm
Cost of hash tree traversal per potential candidate
Cost of checking at the leaf with S candidates
Expected number of leaves visited with i potential candidates and j leaves
ttravers
tcheck
Vi;j
V1;j = 1
Vi;j = Vi 1;j Pv + (Vi 1;j + 1) Pn
= Vi 1;j Vi j1;j + (Vi 1;j + 1) j Vji 1;j
= 1 + j j 1 Vi 1;j
i
1 jj1
=
1 jj1
i
i
= j (j 1)
(1)
ji 1
Note that for large j , Vi;j ' i. This can be shown by taking
limit on Equation 1:
j i (j 1)i
V
=
lim
lim
j !1 i;j
j !1
ji 1
j [i(i 1) 3 2] (j 1)
= [i(i 1) 3(i 2]1)(
i 2) 2 1
= ij i(j 1)
= i
(2)
This shows that if the hash tree size is much larger than the
number of potential candidates in a transaction, then each
potential candidate is likely to visit a distinct leaf node in
the hash tree.
N T
trans +
|P {z }
subset function
O
(M )
+
O
(M )
| {z }
| {z }
hash tree construction global reduction
N V t
= N
C
t
+
travers
P
P C;L check +
O(M )
(4)
Comparing Equation 4 to Equation 3, we see that CD performs no redundant computation. In particular, both the
time for traversal and for checking scales down by a factor
of P .
However, the cost of hash tree construction is the same as
the serial algorithm, and CD has additional cost of global
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
O( M
P)
| {z }
O
(N )
| {z }
O( M
P ) + O (N )
(5)
12
set and the use of bitmap to prune at the root of the hash
tree. More precisely, the number of potential candidates
that need to be checked for a transaction is roughly C=P
assuming that we have a good balanced partition. So the
computation per transaction is:
IDD = C ttravers + V C L tcheck
Ttrans
P ;P
P
| {z }
O
(N )
| {z }
O( M
P ) + O(N )
(6)
Comparing Equation 5 with the serial complexity (Equation 3), we see that the DD algorithm does not reduce
the computation associated with the hash tree traversal.
For both the serial Apriori and the DD algorithm, this
cost is N C ttravers . However, the DD algorithm is
able to reduce the cost associated with the checking at
the leaf nodes. In particular, it reduces the serial cost of
N VC;L tcheck down to N VC; PL tcheck . However, because VC; PL > VC;L =P , the reduction achieved in this part
is less than a factor of P . We can easily see this if we consider the case when L is very large. In this case, VC; PL ' C
and VC;L =P ' C=P by Equation 2. Thus, the number of
leaf nodes checked over all the processors by the DD algorithm is higher than that of the serial algorithm. This is
why the DD algorithm performs redundant computation.
Furthermore, DD has an extra cost of data movement.
Due to these two factors, DD does not scale with respect
to increasing N . However, the cost of building hash tree
HD = C ttravers + V C L tcheck
scales down by a factor of P . Thus, DD is scalable with
Ttrans
G;G
G
respect to increasing M .
The IDD algorithm. In the IDD algorithm, just like the DD The total number of transactions each processor has to
algorithm, the average number of leaf nodes in the local process is GN=P . Thus the computation per processor is:
hash tree of each processor is L=P . However, the average
number of potential candidates that need to be checked Tcomp
HD = G N T HD +
O( M
)
+
trans
P
for each transaction at each processor is much less than
| {zG }
DD, because of the intelligent partitioning of candidates
hash tree construction
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
{z
O( M
G)
| {z }
G N VC L t
check +
G;G
P
GN)
)
+
O
(
O( M
G
P
350
300
(7)
M
N
O(G N
P ) + O( G ) < O( P ) + O(M )
(8)
CD
IDD
HD
DD
DD + comm.
400
We implemented our parallel algorithms on a 128processor Cray T3E and SP2 parallel computers. Each
processor on the T3E is a 600 Mhz Dec Alpha (EV5),
and has 512 Mbytes of memory. The processors are interconnected via a three dimensional torus network that
has a peak unidirectional bandwidth of 430 Mbytes/second,
and a small latency. For communication we used the message passing interface (MPI). Our experiments have shown
that for 16 Kbyte messages we obtain a bandwidth of
303 Mbytes/second and an eective startup time of 16
microseconds. SP2 nodes consist of a Power2 processor
clocked 66.7 MHz with 128 Kbytes data cache, 32 Kbytes
O( G P N ) +
13
250
200
150
100
50
0
0
20
40
60
80
Number of processors
100
120
140
Fig. 10. Scaleup result on Cray T3E with 50K transactions and 0.1%
minimum support.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
14
DD
IDD
HD
CD
IDD
1000
70
60
800
80
50
40
30
600
400
20
200
10
0
0
10
15
20
Number of processors
25
30
35
4
6
8
Number of Candidates (in milliions)
10
12
Fig. 11. Comparison of DD and IDD in terms of the average number Fig. 12. Response time on 16 processor IBM SP2 with 100K transof distinct leaf node visited per transaction with 50K transactions
actions as the minimum support varies from 0.1% to 0.025%.
per processor and 0.2% minimum support.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
15
500
CD
IDD
HD
50
CD
IDD
HD
450
400
350
Speedup
40
30
20
300
250
200
150
100
10
50
0
0
0
10
20
30
40
Number of processors
50
60
70
10
15
20
Number of transactions (in millions)
25
30
Fig. 13. Speedup of three algorithms on Cray T3E as P is increased Fig. 14. Runtime of three algorithms on Cray T3E as N is increased
from 4 to 64 with N = 1:3 million and M = 0:7 million. The
from 1.3 million to 26.1 million with M = 0:7 million and P = 64.
processor congurations for HD were 8 2 for 16 processors, 8 4
The processor conguration for HD was 8 8.
for 32 processors, and 8 8 for 64 processors.
120
100
CD
IDD
HD
80
60
40
20
0
0
3
4
5
6
Number of candidates (in millions)
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
16