0% found this document useful (0 votes)

77 views16 pages

Assoc Parallel Journal

The document summarizes two new parallel algorithms - IDD and HD - for efficiently mining association rules from large transactional databases in parallel. IDD improves upon previous algorithms by intelligently partitioning the candidate set to minimize communication and redundant computation. However, IDD suffers from load imbalance issues. HD is a hybrid algorithm that combines advantages of previous algorithms through dynamic processor grouping and partitioning to maintain good load balance. Experimental results on a 128-processor system show that HD scales well with both increasing transactions and candidate sets.

Uploaded by

Anonymous RrGVQj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views16 pages

Assoc Parallel Journal

Uploaded by

Anonymous RrGVQj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO.

Y, MONTH 1999

Scalable Parallel Data Mining for Association Rules

Eui-Hong (Sam) Han, George Karypis, and Vipin Kumar

Abstract | In this paper we propose two new parallel formulations of the Apriori algorithm that is used for computing
association rules. These new formulations, IDD and HD, address the shortcomings of two previously proposed parallel
formulations CD and DD. Unlike the CD algorithm, the IDD
algorithm partitions the candidate set intelligently among
processors to eciently parallelize the step of building the
hash tree. The IDD algorithm also eliminates the redundant work inherent in DD, and requires substantially smaller
communication overhead than DD. But IDD suers from the
added cost due to communication of transactions among processors. HD is a hybrid algorithm that combines the advantages of CD and DD. Experimental results on a 128-processor
Cray T3E show that HD scales just as well as the CD algorithm with respect to the number of transactions, and scales
as well as IDD with respect to increasing candidate set size.
Keywords | Data mining, parallel processing, association
rules, load balance, scalability.

I. Introduction

tribution (DD). The CD algorithm scales linearly and has

excellent speedup and sizeup behavior with respect to the
number of transactions [6]. However, there are two problems with this algorithm. First, it does not parallelize the
computation for building the hash tree. On a serial algorithm, this step takes relatively small amount of time.
But on parallel computations, it can become a major bottleneck. Second, if the hash tree does not t in the main
memory, then the extra disk I/O for the multiple passes
over the transaction database can be expensive on machines
with slow I/O systems. Hence, the CD algorithm, like its
sequential counterpart Apriori, is unscalable with respect
to the increasing size of candidate set. The DD algorithm
addresses these problems of the CD algorithm by partitioning the candidate set and assigning a partition to each processor in the system. However, this algorithm suers from
three types of ineciency. First, the algorithm results in
high communication overhead due to an inecient scheme
used for data movement. Second, the schedule for interactions among processors is such that it can cause processors
to idle. Third, each transaction has to be processed against
multiple hash trees causing redundant computation.
In this paper, we present two new parallel formulations of
the Apriori algorithm for mining association rules. We rst
present Intelligent Data Distribution (IDD) algorithm that
improves upon the DD algorithm by minimizing communication overhead and processor idling time, and by eliminating redundant computation. However, the static partitioning of the hash tree results in load imbalance that becomes
severe for large number of processors. Furthermore, even
with the optimized communication scheme, the communication overhead of IDD grows linearly with the number of
transactions. Our second formulation, the Hybrid Distribution (HD) algorithm, combines the advantages of both
the CD algorithm and the IDD algorithm by dynamically
grouping processors and partitioning the candidate set accordingly to maintain good load balance. The experimental results on a Cray T3E parallel computer show that the
HD algorithm scales very well and exploits the aggregate
memory eciently.
The rest of this paper is organized as follows. Section II
provides an overview of the serial algorithm for mining association rules. Section III describes existing and proposed
parallel algorithms. Section IV presents the performance
analysis of the algorithms. Experimental results are shown
in Section V. Section VI contains conclusions. A preliminary version of this paper appeared in [7].

One of the important problems in data mining [1] is discovering association rules from databases of transactions,
where each transaction contains a set of items. The most
time consuming operation in this discovery process is the
computation of the frequencies of the occurrence of subsets
of items, also called candidates, in the database of transactions. Since usually such transaction-based databases
contain a large number of distinct items, the total number of candidates is prohibitively large. Hence, current
association rule discovery techniques [2], [3], [4], [5] try to
prune the search space by requiring a minimum level of
support for candidates under consideration. Support is a
measure of the number of occurrences of the candidates in
database transactions. Apriori [2] is a recent state-of-theart algorithm that aggressively prunes the set of potential
candidates of size k by using the following observation: a
candidate of size k can meet the minimum level of support only if all of its subsets also meet the minimum level
of support. In the kth iteration, this algorithm computes
the occurrences of potential candidates of size k in each
of the transactions. To do this task eciently, the algorithm maintains all potential candidates of size k in a hash
tree. This algorithm does not require the transactions to
stay in main memory, but requires the hash trees to stay in
main memory. If the entire hash tree cannot t in the main
memory, then the hash tree needs to be partitioned, and
multiple passes over the transaction database need to be
performed (one for each partition of the hash tree). Even
with the highly eective pruning method of Apriori, the
task of nding all association rules in many applications
can require a lot of computation power that is available
II. Basic Concepts
only in parallel computers.
Let T be the set of transactions where each transaction
Two parallel formulations of the Apriori algorithm were
proposed in [6], Count Distribution (CD) and Data Dis- is a subset of the item-set I . Let C be a subset of I , then

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

TABLE I

Transactions from supermarket.

TID
1
2
3
4
5

Items
Bread, Coke, Milk
Beer, Bread
Beer, Coke, Diaper, Milk
Beer, Bread, Diaper, Milk
Coke, Diaper, Milk

we dene the support count of C with respect to T to be:

(C ) = jftjt 2 T; C tgj:

Thus (C ) is the number of transactions that contain C .
For example, consider a set of transactions from supermarket as shown in Table I. The items set I for these transactions is fBread, Beer, Coke, Diaper, Milkg. The support
count of fDiaper, Milkg is (Diaper; Milk) = 3, whereas
(Diaper; Milk; Beer) = 2.
An association rule is an expression of the form X =s;
)Y,
where X I and Y I . The support s of the rule
X =s;
) Y is dened as (X [ Y )=jT j, and the condence
is dened as (X [ Y )=(X ). For example, consider a
rule fDiaper, Milkg =) fBeerg, i.e. presence of diaper
and milk in a transaction tends to indicate the presence
of beer in the transaction. The support of this rule is
(Diaper; Milk; Beer)=5 = 40%. The condence of this
rule is (Diaper; Milk; Beer)=(Diaper; Milk) = 66%. A
rule that has a very high condence (i.e., close to 1.0) is
often very important, because it provides an accurate prediction on the association of the items in the rule. The
support of a rule is also important, since it indicates how
frequent the rule is in the transactions. Rules that have
very small support are often uninteresting, since they do
not describe signicantly large populations. This is one of
the reasons why most algorithms [2], [3], [4] disregard any
rules that do not satisfy the minimum support condition
specied by the user. This ltering due to the minimum
required support is also critical in reducing the number of
derived association rules to a manageable size. Note that
the total number of possible rules is proportional to the
number of subsets of the item-set I , which is 2jI j. Hence
the ltering is absolutely necessary in most practical settings.
The task of discovering an association rule is to nd all
rules X =s;
) Y , such that s is greater than or equal to a
given minimum support threshold and is greater than or
equal to a given minimum condence threshold. The association rule discovery is composed of two steps. The rst
step is to discover all the frequent item-sets (candidate sets
that have more support than the minimum support threshold specied). The second step is to generate association
rules from these frequent item-sets. The computation of
nding the frequent item-sets is much more expensive than
nding the rules from these frequent item-sets. Hence in
this paper, we only focus on the rst step. The parallel

1. F1 = f frequent 1-item-setsg ;
2. for ( k = 2; Fk 1 6= ; k++ ) f
3.
Ck = apriori gen(Fk 1 )
4.
for all transactions t 2 T f
5.
subset(Ck , t)
6.
g
7.
Fk = fc 2 Ck j c.count minsupg
8. g
S
9. Answer = Fk
Fig. 1. Apriori Algorithm

implementation of the second step is straightforward and

is discussed in [6].
A number of sequential algorithms have been developed
for discovering frequent item-sets [8], [2], [3]. Our parallel algorithms are based on the Apriori algorithm [2] that
has smaller computational complexity compared to other
algorithms. In the rest of this section, we brie y describe
the Apriori algorithm. The reader should refer to [2] for
further details.
The high level structure of the Apriori algorithm is given
in Figure 1. The Apriori algorithm consists of a number
of passes. Initially F1 contains all the items (i.e., item set
of size one) that satisfy the minimum support requirement.
During pass k, the algorithm nds the set of frequent itemsets Fk of size k that satisfy the minimum support requirement. The algorithm terminates when Fk is empty. In each
pass, the algorithm rst generates Ck , the candidate itemsets of size k. Function apriori gen(Fk 1 ) constructs Ck by
extending frequent item-sets of size k 1. This ensures that
all the subsets of size k 1 of a new candidate item-set are
in Fk 1 . Once the candidate item-sets are found, their frequencies are computed by counting how many transactions
contain these candidate item-sets. Finally, Fk is generated by pruning Ck to eliminate item-sets with frequencies
smaller than theSminimum support. The union of the frequent item-sets, Fk , is the frequent item-sets from which
we generate association rules.
Computing the counts of the candidate item-sets is the
most computationally expensive step of the algorithm. One
naive way to compute these counts is to perform stringmatching of each transaction against each candidate itemset. A faster way of performing this operation is to use
a candidate hash tree in which the candidate item-sets are
hashed [2]. Here we explain this via an example to facilitate
the discussions of parallel algorithms and their analysis.
Figure 2 shows one example of the candidate hash tree
with candidates of size 3. The internal nodes of the hash
tree have hash tables that contain links to child nodes. The
leaf nodes contain the candidate item-sets. A hash tree of
candidate item-sets is constructed as follows. Initially, the
hash tree contains only a root node, which is a leaf node
containing no candidate item-set. When each candidate
item-set is generated, the items in the set are stored in
sorted order. Note that since C1 and F1 are created in

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

3
1 +

2356

2 +

356

3 +

Hash Function
Transaction

1,4,7

3,6,9

12356

2,5,8

Candidate Hash Tree

1 +

Transaction

2 +

2356

356

12356

1 2 +

3 5 6

1 3 +

5 6

234

567

1 5 +

145

3 +

567
136

345

356

367

357

368

689
125

457

458

356

367

357

368

689
124

125

457

458

159

Fig. 3. Subset operation on the left most subtree of the root of a

candidate hash tree.
234

124

345

Candidate Hash Tree

145

136

159

Fig. 2. Subset operation on the root of a candidate hash tree.

sorted order, each candidate set is generated in sorted order without any need for explicit sorting. Each candidate
item-set is inserted into the hash tree by hashing each successive item at the internal nodes and then following the
links in the hash table. Once a leaf is reached, the candidate item-set is inserted at the leaf if the total number of
candidate item-sets are less than the maximum allowed. If
the total number of candidate item-sets at the leaf exceeds
the maximum allowed and the depth of the leaf is less than
k, the leaf node is converted into an internal node and child
nodes are created for the new internal node. The candidate
item-sets are distributed to the child nodes according to the
hash values of the items. For example, the candidate item
set f1 2 4g is inserted by hashing item 1 at the root to reach
the left child node of the root, hashing item 2 at that node
to reach the middle child node, hashing item 3 to reach the
left child node which is a leaf node.
The subset function traverses the hash tree from the root
with every item in a transaction as a possible starting item
of a candidate. In the next level of the tree, all the items
of the transaction following the starting item are hashed.
This is done recursively until a leaf is reached. At this

time, all the candidates at the leaf are checked against the
transaction and their counts are updated accordingly. Figure 2 shows the subset operation at the rst level of the
tree with transaction f1 2 3 5 6g. The item 1 is hashed to
the left child node of the root and the following transaction f2 3 5 6g is applied recursively to the left child node.
The item 2 is hashed to the middle child node of the root
and the whole transaction is checked against two candidate
item-sets in the middle child node. Then item 3 is hashed
to the right child node of the root and the following transaction f5 6g is applied recursively to the right child node.
Figure 3 shows the subset operation on the left child node
of the root. Here the items 2 and 5 are hashed to the middle child node and the following transactions f3 5 6g and
f6g respectively are applied recursively to the middle child
node. The item 3 is hashed to the right child node and the
remaining transaction f5 6g is applied recursively to the
right child node.
III. Parallel Algorithms

In this section, we will focus on the parallelization of the

task that nds all frequent item-sets. We rst discuss two
parallel algorithms proposed in [6] to help motivate our
parallel formulations. We also brie y discuss other parallel algorithms. In all our discussions, we assume that the
transactions are evenly distributed among the processors.
A. Count Distribution Algorithm
In the Count Distribution (CD) algorithm proposed
in [6], each processor computes how many times all the candidates appear in the locally stored transactions. This is
done by building the entire hash tree that corresponds to all
the candidates and then performing a single pass over the
locally stored transactions to collect the counts. The global
counts of the candidates are computed by summing these

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

individual counts using a global reduction operation [9].

This algorithm is illustrated in Figure 4. Note that since
each processor needs to build a hash tree for all the candidates, these hash trees are identical at each processor.
Thus, excluding the global reduction, each processor in the
CD algorithm executes the serial Apriori algorithm on the
locally stored transactions.
This algorithm has been shown to scale linearly with the
number of transactions [6]. This is because each processor
can compute the counts independently of the other processors and needs to communicate with the other processors
only once at the end of the computation step. However, this
algorithm does not parallelize the computation of building
the candidate hash tree. This step becomes a bottleneck
with large number of processors. Furthermore, if the number of candidates is large, then the hash tree does not t
into the main memory. In this case, this algorithm has to
partition the hash tree and compute the counts by scanning the database multiple times, once for each partition of
the hash tree. The cost of extra database scanning can be
expensive in the machines with slow I/O system. Note that
the number of candidates increases if either the number of
distinct items in the database increases or if the minimum
support level of the association rules decreases. Thus the
CD algorithm is eective for small number of distinct items
and a high minimum support level.
B. Data Distribution Algorithm
The Data Distribution (DD) algorithm [6] addresses the
memory problem of the CD algorithm by partitioning the
candidate item-sets among the processors. This partitioning is done in a round robin fashion. Each processor is
responsible for computing the counts of its locally stored
subset of the candidate item-sets for all the transactions in
the database. In order to do that, each processor needs to
scan the portions of the transactions assigned to the other
processors as well as its locally stored portion of the transactions. In the DD algorithm, this is done by having each
processor receive the portions of the transactions stored in
the other processors as follows. Each processor allocates
P buers (each one page long and one for each processor).
At processor Pi , the ith buer is used to store transactions
from the locally stored database and the remaining buers
are used to store transactions from the other processors.
Now each processor Pi checks the P buers to see which
one contains data. Let l be this buer (ties are broken in
favor of buers of other processors and ties among buers
of other processors are broken arbitrarily). The processor processes the transactions in this buer and updates
the counts of its own candidate subset. If this buer corresponds to the buer that stores local transactions (i.e.,
l = i), then it is sent to all the other processors (via asynchronous sends), and a new page is read from the local
database. If this buer corresponds to a buer that stores
transactions from another processor (i.e., l 6= i), then it is
cleared and this buer is marked available for next asynchronous receive from any other processors. This continues
until every processor has processed all the transactions.

Having computed the counts of its candidate item-sets,

each processor nds the frequent item-sets from its candidate item-set and these frequent item-sets are sent to every
other processor using an all-to-all broadcast operation [9].
Figure 5 shows the high level operations of the algorithm.
Note that each processor has a dierent set of candidates
in the candidate hash tree.
This algorithm exploits the total available memory better than CD, as it partitions the candidate set among processors. As the number of processors increases, the number
of candidates that the algorithm can handle also increases.
However, as reported in [6], the performance of this algorithm is signicantly worse than the CD algorithm. The
run time of this algorithm is 10 to 20 times more than that
of the CD algorithm on 16 processors [6]. The problem
lies with the communication pattern of the algorithm and
the redundant work that is performed in processing all the
transactions.
The communication pattern of this algorithm causes
three problems. First, during each pass of the algorithm
each processor sends to all the other processors the portion
of the database that resides locally. In particular, each
processor reads the locally stored portion of the database
one page at a time and sends it to all the other processors
by issuing P 1 send operations. Similarly, each processor
issues a receive operation from each other processor in order to receive these pages. If the interconnection network
of the underlying parallel computer is fully connected (i.e.,
there is a direct link between all pairs of processors) and
each processor can receive data on all incoming links simultaneously, then this communication pattern will lead
to a very good performance. In particular, if O(N=P ) is
the size of the database assigned locally to each processor, the amount of time spent in the communication will
be O(N=P ). However, even on the parallel computer with
fully connected network, if each processor can receive data
from (or send data to) only one other processor at a time,
then the communication will be O(N ). On all realistic parallel computers, the processors are connected via a sparser
networks (such as 2D, 3D or hypercube) and a processor
can receive data from (or send data to) only one other
processor at a time. On such machines, this communication pattern will take signicantly more than O(N ) time
because of contention within the network.
Second, in architectures without asynchronous communication support and with nite number of communication
buers in each processor, the proposed all-to-all communication scheme causes processors to idle. For instance,
consider the case when one processor nishes its operation
on local data and sends the buer to all other processors.
Now if the communication buer of any receiving processors is full and the outgoing communication buers are full,
then the send operation is blocked.
Third, if we look at the size of the candidate sets as a
function of the number of passes of the algorithm, we see
that in the rst few passes, the size of the candidate sets
increases and after that it decreases. In particular, during
the last several passes of the algorithm, there are only a

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

Proc 0

Proc 1

Data

Proc 2

Data

N/P

Data

N/P

Count

N/P

Count

Candidate Hash Tree

Proc 3

Data

N/P

Count

Candidate Hash Tree

Count
Candidate Hash Tree

Candidate Hash Tree

{1, 2}

{1, 3}

{2, 3}

{2, 4}

{3, 4}

{4, 5}

Global Reduction
N: number of data items
M: size of candidate set
P: number of processors

Fig. 4. Count Distribution (CD) Algorithm

Proc 0

Local Data

Data

Proc 1

Remote Data

Local Data

Data

N/P

Broadcast

Proc 2

Remote Data

N/P

Data

Broadcast

Count

Candidate Hash Tree

M/P

{1, 2}

{2, 5}

{4, 6}

N: number of data items

Local Data

Proc 3

Remote Data

N/P

Data

Broadcast

Count

Candidate Hash Tree

M/P

{1, 3}

{3, 4}

{4, 7}

Local Data

Remote Data

Data

N/P

Broadcast

Count

Candidate Hash Tree

M/P

{1, 4}

{3, 5}

{5, 6}

All-to-all Broadcast

M: size of candidate set

P: number of processors

Fig. 5. Data Distribution (DD) Algorithm

Count

Candidate Hash Tree

M/P

{2, 3}

{4, 5}

{5, 7}

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

small number of items in the candidate sets. However,

each processor in the DD algorithm still sends the locally
stored portions of the database to all the other processors.
Thus, even though the computation decreases, the amount
of communication remains the same.
The redundant work is introduced due to the fact that
every processor has to process every single transaction in
the database. In CD (see Figure 4), only N=P transactions
go through each hash tree of M candidates, whereas in DD
(see Figure 5), all N transactions have to go through each
hash tree of M=P candidates. Although, the number of
candidates stored at each processor has been reduced by
a factor of P , the amount of computation performed for
each transaction has not been proportionally reduced. If
the amount of work required for each transaction to be
checked against the hash tree of M=P candidates is 1=P
of that of the hash tree of M candidates, then there is no
extra work. As discussed in Section IV, in general, the
amount of work per transaction will go down by a factor
much smaller than P .
C. Intelligent Data Distribution Algorithm
We developed the Intelligent Data Distribution (IDD) algorithm that solves the problems of the DD algorithm discussed in Section III-B. In IDD, the locally stored portions
of the database are sent to all the other processors by using
a ring-based all-to-all broadcast described in [9]. This operation does not suer from the contention problems of the
DD algorithm and it takes O(N ) time on any parallel architecture that can be embedded in a ring. Figure 6 shows the
pseudo code for this data movement operation. In our algorithm, the processors form a logical ring and each processor
determines its right and left neighboring processors. Each
processor has one send buer (SBuf) and one receive buer
(RBuf). Initially, the SBuf is lled with one block of local
data. Then each processor initiates an asynchronous send
operation to the right neighboring processor with SBuf and
an asynchronous receive operation to the left neighboring
processor with RBuf. While these asynchronous operations
are proceeding, each processor processes the transactions in
SBuf and collects the counts of the candidates assigned to
the processor. After this operation, each processor waits
until these asynchronous operations complete. Then the
roles of SBuf and RBuf are switched and the above operations continue for P 1 times. Compared to DD, where all
the processors send data to all other processors, we perform
only a point-to-point communication between neighbors,
thus eliminating any communication contention. Furthermore, if the time to process a buer does not vary much,
then there is little time lost in idling.
In order to eliminate the redundant work due to the partitioning of the candidate item-sets, we must nd a fast way
to check whether a given transaction can potentially contain any of the candidates stored at each processor. This
cannot be done by partitioning Ck in a round-robin fashion.
However, if we partition Ck among processors in such a way
that each processor gets item-sets that begin only with a
subset of all possible items, then we can check the items of

while (!done) f
FillBuer(fd, SBuf);
for (k = 0; k < P-1; ++k) f
/* send/receive data in non-blocking pipeline */
MPI Irecv(RBuf, left);
MPI Isend(SBuf, right);
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);
MPI Waitall();
/* swap two buers */
tmp = SBuf;
SBuf = RBuf;
RBuf = tmp;
g
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);

Fig. 6. Pseudo Code for Data Movements

a transaction against this subset to determine if the hash

tree contains candidates starting with these items. We traverse the hash tree with only the items in the transaction
that belong to this subset. Thus, we solve the redundant
work problem of DD by the intelligent partitioning of Ck .
Figure 7 shows the high level picture of the algorithm.
In this example, Processor 0 has all the candidates starting
with items 1 and 7, Processor 1 has all the candidates starting with 2 and 5, and so on. Each processor keeps the rst
items of the candidates it has in a bit-map. In the Apriori algorithm, at the root level of hash tree, every item in
a transaction is hashed and checked against the hash tree.
However, in our algorithm, at the root level, each processor
lters every item of the transaction by checking against the
bit-map to see if the processor contains candidates starting with that item of the transaction. If the processor does
not contain the candidates starting with that item, the processing steps involved with that item as the rst item in
the candidate can be skipped. This reduces the amount
of transaction data that has to go through the hash tree;
thus, reducing the computation. For example, let f1 2 3
4 5 6 7 8g be a transaction that processor 0 is processing
in the subset function discussed in Section II. At the top
level of the hash tree, processor 0 will only proceed with
items 1 and 7 (i.e., 1 + 2 3 4 5 6 7 8 and 7 + 8). When the
page containing this transaction is shifted to processor 1,
this processor will only process items starting with 2 and 5
(i.e., 2 + 3 4 5 6 7 8 and 5 + 6 7 8). Figure 8 shows how this
scheme works when a processor contains only those candidate item-sets that start with 1, 3 and 5. Thus for each
transaction in the database, our approach partitions the
amount of work to be performed among processors, thus
eliminating most of the redundant work of DD. Note that
both the judicious partitioning of the hash tree (indirectly
caused by the partitioning of candidate item-set) and the
ltering step are required to eliminate this redundant work.
The intelligent partitioning of the candidate set used in
IDD requires our algorithm to have a good load balancing.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999
Proc 0

Local Data

Data

Proc 1

Remote Data

N/P

Local Data

Data

Shift

Proc 2

Remote Data

N/P

Count

Count
Bit Map

1, 7

Remote Data

N/P

Local Data

Data

N/P

Count

Bit Map

Shift

Count

Bit Map

Candidate Hash Tree

Remote Data

Shift

2, 5

Candidate Hash Tree

M/P

Proc 3

Shift

Count

Bit Map

Local Data

Data

Shift

Count

3, 6

Candidate Hash Tree

{1, 2}

{2, 3}

{4, 5}

{3, 4}

{1, 3}

{2, 5}

{4, 6}

{3, 5}

{7, 8}

{5, 6}

{4, 7}

{6, 7}

M/P

All-to-all Broadcast
N: number of data items
M: size of candidate set
P: number of processors

Fig. 7. Intelligent Data Distribution (IDD) Algorithm

Transaction

bitmap

12356

1, 3, 5

1 +

2356

2 +

356

3 +

Skipped!!

Candidate Hash Tree

567

145

124

136

125

345

356

367

357

368

159

Fig. 8. Subset operation on the root of a candidate hash tree in IDD.

One of the criteria of a good partitioning involved here is

to have an equal number of candidates in all the processors. This gives about the same size hash tree in all the
processors and thus provides good load balancing among
processors. Note that in the DD algorithm, this was accomplished by distributing candidates in a round robin fashion.
A naive method for assigning candidates to processors can
lead to a signicant load imbalance. For instance, consider
a database with 100 distinct items numbered from 1 to 100
and that the database transactions have more data items
numbered with 1 to 50. If we partition the candidates be-

tween two processors and assign all the candidates starting

with items 1 to 50 to processor P0 and candidates starting
with items 51 to 100 to processor P1 , then there would be
more work for processor P0 .
To achieve an equal distribution of the candidate itemsets, we use a partitioning algorithm that is based on binpacking [10]. For each item, we rst compute the number of
candidate item-sets starting with this particular item. Note
that at this time we do not actually store the candidate
item-sets, but just store the number of candidate item-sets
starting with each item. We then use a bin-packing algorithm to partition these items in P buckets such that the
sum of numbers of the candidate item-sets starting with
these items in each bucket are roughly equal. Once the location of each candidate item-set is determined, then each
processor locally regenerates and stores candidate item-sets
that are assigned to this processor. Note that bin-packing
is used per pass of the algorithm and the amount of time
spent on bin-packing is minor compared to the overall runtime. Figure 7 shows the partitioned candidate hash tree
and its corresponding bitmaps in each processor.
Note that this scheme will not be able to achieve an
equal distribution of candidates if there are too many candidate itemsets starting with the same item. For example,
if there are more than M=P candidates starting with the
same item, then one processor containing candidates starting with this item will have more than M=P candidates
even if no other candidates are assigned to it. This problem
gets more serious with increasing P . One way of handling
this problem is to partition candidate item sets based on

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

more than the rst items of the candidate item sets. In

this approach, whenever the number of candidates starting
with one particular item is greater than the threshold, this
item set is further partitioned using the second item of the
candidate item sets.
Note that the equal assignment of candidates to the processors does not guarantee the perfect load balance among
processors. This is because the cost of traversal and checking at the leaf node are determined not only by the size and
shape of the candidate hash tree, but also by the actual
items in the transactions. However, in our experiments,
we have observed a reasonably good correlation between
the size of candidate sets and the amount of work done by
each processor. For example, with 4 processors, we were
able to obtain the the load imbalance of 1.3% in terms
of the number of candidate sets, and this translated into
5.4% load imbalance in the actual computation time. With
8 processors, we had 2.3% load imbalance in the number of
candidate sets, and this resulted in 9.4% load imbalance in
the computation time. Since the eect of transactions on
the work load cannot be easily estimated in advance, our
scheme only ensures that each processor has roughly equal
number of candidate itemsets in the local hash tree.
D. Hybrid Algorithm
The IDD algorithm exploits the total system memory by
partitioning the candidate set among all processors. The
average number of candidates assigned to each processor is
M=P , where M is the number of total candidates. As more
processors are used, the number of candidates assigned to
each processor decreases. This has two implications. First,
with fewer number of candidates per processor, it is much
more dicult to balance the work. Second, the smaller
number of candidates gives a smaller hash tree and less
computation work per transaction. Eventually the amount
of computation may become less than the communication
involved. This would be more evident in the later passes
of the algorithm as the hash tree size further decreases
dramatically. This reduces overall eciency of the parallel
algorithm. This will be an even more serious problem in a
system that cannot perform asynchronous communication.
The Hybrid Distribution (HD) algorithm addresses the
above problem by combining the CD and the IDD algorithms in the following way. Consider a P -processor system
in which the processors are split into G equal size groups,
each containing P=G processors. In the HD algorithm, we
execute the CD algorithm as if there were only P=G processors. That is, we partition the transactions of the database
into P=G parts each of size N=(P=G), and assign the task
of computing the counts of the candidate set Ck for each
subset of the transactions to each one of these groups of
processors. Within each group, these counts are computed
using the IDD algorithm. That is, the transactions and the
candidate set Ck are partitioned among the processors of
each group, so that each processor gets roughly jCk j=G candidate item-sets and N=P transactions. Now, each group of
processors computes the counts using the IDD algorithm,
and the overall counts are computing by performing a re-

duction operation among the P=G groups of processors.

The HD algorithm can be better visualized if we think of
the processors as being arranged in a two dimensional grid
of G rows and P=G columns. The transactions are partitioned equally among the P processors. The candidate set
Ck is partitioned among the processors of each column of
this grid. This partitioning of Ck is identical for each column of processors; i.e., the processors along each row of the
grid get the same subset of Ck . Figure 9 illustrates the HD
algorithm for a 3 4 grid of processors. In this example, the
HD algorithm executes the CD algorithm as if there were
only 4 processors, where the 4 processors correspond to the
4 processor columns. That is, the database transactions are
partitioned in 4 parts, and each one of these 4 hypothetical processors computes the local counts of all the candidate item-sets. Then the global counts can be computed
by performing the global reduction operation discussed in
Section III-A. However, since each one of these hypothetical processors is made up of 3 processors, the computation
of local counts of the candidate item-sets in a hypothetical processor requires the computation of the counts of the
candidate item-sets on the database transactions sitting on
the 3 processors. This operation is performed by executing
the IDD algorithm within each of 4 hypothetical processors.
This is shown in the step 1 of Figure 9. Note that processors in the same row have exactly the same candidates, and
candidate sets along the each column partition the total
candidate set. At the end of this operation, each processor
has complete count of its local candidates for all the transactions located in the processors of the same column (i.e.,
of a hypothetical processor). Now a reduction operation is
performed along the rows such that all processors in each
row have the sum of the counts for the candidates in the
same row. At this point, the count associated with each
candidate item-set corresponds to the entire database of
transactions. Now each processor nds frequent item-sets
by dropping all those candidate item-sets whose frequency
is less than the threshold for minimum support. These
candidate item-sets are shown as shaded in Figure 9(b). In
the next step, each processor performs all-to-all broadcast
operation along the columns of the processor mesh. At
this point, all the processors have the frequent sets and are
ready to proceed to the next pass.
The HD algorithm determines the conguration of the
processor grid dynamically. In particular, the HD algorithm partitions the candidate set into a big enough section
and assign a group of processors to each partition. Let m
be a user specied threshold. If the total number of candidates M is less than m, then the HD algorithm makes G
equal to 1, which means that the CD algorithm is run on
all the processors. Otherwise G is set to dM=me. Table II
shows how the HD algorithm chose the processor conguration based on the number of candidates at each pass with
64 processors and m = 50K .
The HD algorithm inherits all the good features of the
IDD algorithm. It also provides good load balance and
enough computation work by maintaining minimum number of candidates per processor. At the same time, the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

Step 1: Partitioning of Candidate Sets and Data Movement Along the Columns
Candidate Hash Tree

Candidate Hash Tree

1, 2

4, 5

7, 8

Data Shift

Candidate Hash Tree

2, 3

5, 6

8, 9

Data Shift

2, 3

5, 6

8, 9

Data Shift

2, 3

5, 6

8, 9

Data Shift

2, 3

5, 6

8, 9

Data Shift

Candidate Hash Tree

3, 4

6, 7

6, 8

Data Shift

Step 2: Reduction Operation Along the Rows

Candidate Hash Tree

1, 2

4, 5

7, 8

Candidate Hash Tree

2, 3

5, 6

8, 9

Candidate Hash Tree

3, 4

6, 7

6, 8

Step 3: All-to-all Broadcast Operation Along the Columns

Frequent Item Set

7, 8

8, 9

Frequent Item Set

7, 8

8, 9

Frequent Item Set

All-to-all
Broadcast

Frequent Item Set

7, 8

8, 9

Frequent Item Set

All-to-all
Broadcast

7, 8

8, 9

Frequent Item Set

All-to-all
Broadcast

7, 8

8, 9

Frequent Item Set

7, 8

8, 9

Fig. 9. Hybrid Distribution (HD) Algorithm in 3 4 Processor Mesh (G = 3; P = 12)

All-to-all
Broadcast

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

TABLE II

Processor configuration and number of candidates of the HD algorithm with 64 processors and with m = 50K at each pass.
Note that 64 1 configuration is the same as the IDD algorithm and 1 64 is the same as the CD algorithm. The total
number of pass was 13 and all passes after 6 had 1 64 configuration.

Pass
Conguration
No of Cand.

8 8 64 1 4 16 2 32 2 32 1 64
351K 4348K 115K 76K
56K
34K

amount of data movement in this algorithm has been cut

down to 1=G of the IDD.
E. Other Parallel Algorithms
In addition to CD and DD, four parallel algorithms
(NPA, SPA, HPA and HPA-ELD) for mining association
rules were proposed in [11]. NPA is very similar to CD
and SPA is very similar to DD. HPA and HPA-ELD both
have some similarities with IDD, as all 3 algorithms essentially eliminate the redundant computation inherent in
DD. However, the approach taken in HPA (and HPA-ELD)
is quite dierent than that taken in IDD. In passk of HPA,
for each transaction containing I items, C = kI potential candidates of size k are generated. Each of these
potential candidates is hashed to determine which processor might contain the candidate itemset matching these
potential candidate. These C potential candidates are sent
only to the corresponding processors. Then each processor
checks these potential candidates collected from all the processors against the locally stored subset of candidate itemsets. The distribution of the candidate itemsets over processors is determined by the hash function. This may make
it dicult to ensure that each processor receives equal number of candidates. Furthermore, the number of potential
candidates of size kgenerated for a transaction containing
I items is O( kI ). Hence, for values of k greater than
2, HPA can have much larger communication volume than
that for DD and IDD. For small values of k (e.g., k = 2), it
is possible for HPA to incur smaller communication overhead than IDD.
Several researchers have proposed parallel formulations
of association rule algorithms [12], [13], [14]. Park, Chen,
and Yu proposed PDM [12], a parallel formulation of the serial association rule algorithm DHP [15]. PDM is similar in
nature to the CD algorithm. In [14], Zaki et.al. presented
a parallelization of a serial algorithm originally introduced
in [16]. This serial algorithm is of entirely dierent nature than Apriori, hence its parallel formulations cannot
be compared to the algorithms discussed in this paper.
IV. Performance Analysis

In this section, we analyze the amount of work done by

each algorithm and the scalability of each algorithm. In
this analysis, a parallel algorithm is considered scalable
when the eciency can be maintained as the number of
processors is increased, provided that the problem size is

also increased [9]. Let Tserial be the runtime of a serial

algorithm and Tp be the runtime of a parallel algorithm.
Eciency [9] (E ) of a parallel algorithm is

E = PTserial
T

A parallel algorithm is scalable if P Tp and Tserial remain of the same order [9]. The problem size (i.e., the
serial runtime) for the Apriori algorithm increases either
by increasing N or by increasing M (as a result of lowering the minimum support) in the algorithms discussed in
Section III. Table III describes the symbols used in this
section.
As discussed in Section II, each iteration of the algorithm consists of two steps: (i) candidate generation and
hash tree construction (ii) computation of subset function
for each transaction. The derivation of the runtime of the
subset function is much more involved. Consider a transacth
tion that has I items. During
the
k pass of the algorithm,
I
this transaction has C = k potential candidates that
need to be checked against the candidate hash tree. Note
that for a given transaction, if checking for one potential
candidate leads to a visit to a leaf node, then all the candidates of this transaction are checked against the leaf node.
As a result, if this node is revisited due to a dierent candidate from the same transaction, no checking needs to
be performed. Clearly the total cost of checking at the leaf
nodes is directly proportional to the number of distinct leaf
nodes visited with the transaction. We assume that the average number of candidate item-sets at the leaf nodes is S .
Hence the average number of leaf nodes in a hash tree is
L = M=S . In the implementation of the algorithm, the desired value of S can be obtained by adjusting the branching
factor of the hash tree. In general, the cost of traversal for
each potential candidate will depend on the depth of the
leaf node in the hash tree reached by the traversal. To simplify the analysis, we assume that the cost of each traversal
is the same. Hence, the total traversal cost is directly proportional to C . For each potential candidate, we dene
ttravers to be the cost associated with the traversal of the
hash tree and tcheck to be the cost associated with checking
the candidate item-sets of the reached leaf node.
Note that the number of distinct leaves checked by a
transaction is in general smaller than the number of potential candidates C . This is because dierent potential
candidates may lead to the same leaf node. In general, if C
is relatively large with respect to the number of leaf nodes

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

TABLE III

Symbols used in the analysis.

symbol
N
P
M
G
k
I
C
S
L

denition
Total number of transactions
Number of processors
Total number of candidates
Number of partitions of candidates in the HD algorithm
Pass number in Apriori algorithm
Average number of items in a transaction
Average number of potential candidates in a transaction
Average number of candidates at the leaf node
Average number of leaves in the hash tree for the serial Apriori algorithm
Cost of hash tree traversal per potential candidate
Cost of checking at the leaf with S candidates
Expected number of leaves visited with i potential candidates and j leaves

ttravers
tcheck
Vi;j

in the hash tree, then the number of distinct leaf nodes

visited will be smaller than C . We can compute the expected number of distinct leaf nodes visited as follows. To
simplify the analysis, we assume that each traversal of the
hash tree due to a dierent potential candidate is equally
likely to lead to any leaf node of the hash tree.
Let Pv be the probability of reaching a previously visisted
node and Pn be the probability of reaching a new node.
Then, Vi;j , the expected number of distinct leaf nodes visited when the transaction has i potential candidates, and
the hash tree has j leaf nodes is:

V1;j = 1
Vi;j = Vi 1;j Pv + (Vi 1;j + 1) Pn
= Vi 1;j Vi j1;j + (Vi 1;j + 1) j Vji 1;j
= 1 + j j 1 Vi 1;j

1 jj1

=
1 jj1
i
i
= j (j 1)

(1)
ji 1
Note that for large j , Vi;j ' i. This can be shown by taking
limit on Equation 1:

j i (j 1)i
V
=
lim
lim
j !1 i;j
j !1
ji 1
j [i(i 1) 3 2] (j 1)
= [i(i 1) 3(i 2]1)(
i 2) 2 1
= ij i(j 1)
= i

(2)

This shows that if the hash tree size is much larger than the
number of potential candidates in a transaction, then each
potential candidate is likely to visit a distinct leaf node in
the hash tree.

Serial Apriori algorithm. Recall that in the serial Apriori

algorithm, the average number of leaf nodes in the hash
tree is L = M=S . Hence the number of distinct leaf visited per transaction is VC;L , and the computation time per
transaction for visiting the hash tree is:
Ttrans = C ttravers + VC;L tcheck
So the run time of the serial algorithm for processing N
transactions is:
serial =
Tcomp
N
Ttrans} +
O
(M )
| {z
| {z }
subset function hash tree construction
= N C ttravers + N VC;L tcheck
+O(M )
(3)
The CD algorithm. In the CD algorithm the entire set of
candidates is replicated at each processor. Hence the average number of leaf nodes in the local hash tree at each
processor is L = M=S , which is the same as in the serial
Apriori algorithm. Thus the CD algorithm performs the
same computation per transactions as the serial algorithm,
but each processor handles only N=P number of transactions. Hence the run time of the CD algorithm is:
CD =
Tcomp

N T
trans +
|P {z }

subset function
O
(M )
+
O
(M )
| {z }
| {z }
hash tree construction global reduction
N V t
= N

C

t
+
travers
P
P C;L check +
O(M )
(4)
Comparing Equation 4 to Equation 3, we see that CD performs no redundant computation. In particular, both the
time for traversal and for checking scales down by a factor
of P .
However, the cost of hash tree construction is the same as
the serial algorithm, and CD has additional cost of global

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

CD will grow as O(PM ) with

reduction. Hence, P Tcomp
serial grows only as O(M ).
respect to O(M ), whereas Tcomp
This shows that CD does not scale with respect to the
increasing M . If M is too large to t in the main memory,
then the set of transaction needs to be read from the disk
N
M
M
Mcapacity times, adding another O( P Mcapacity ) term to
the runtime. On some architectures, this can be signicant.
But in our discussion in the rest of the paper, we will ignore
this term.
The DD algorithm. In the DD algorithm, the number of
candidates per processor is M=P , as the candidate set is
partitioned. Hence the average number of leaf nodes in
the local hash tree of each processor is L=P . Therefore,
the number of distinct leaf nodes visited per transaction is
VC; PL , and the computation time per transaction is:
DD = C ttravers + V L tcheck
Ttrans
C; P

The number of transactions processed by each processor is

N , as the transactions are shifted around the processors.
Hence, the computation per processor of the DD algorithm
is:
DD = N T DD +
Tcomp
trans

O( M
P)

| {z }

O
(N )
| {z }

hash tree construction data movement

= N C ttravers + N VC; PL tcheck +

O( M
P ) + O (N )

(5)

set and the use of bitmap to prune at the root of the hash
tree. More precisely, the number of potential candidates
that need to be checked for a transaction is roughly C=P
assuming that we have a good balanced partition. So the
computation per transaction is:
IDD = C ttravers + V C L tcheck
Ttrans
P ;P
P

Thus the computation per processor is:

IDD = N T IDD +
Tcomp
trans
O( M
P)

| {z }

O
(N )
| {z }

hash tree construction data movement

= NC
P ttravers + N V CP ; PL tcheck +

O( M
P ) + O(N )

(6)

Comparing Equation 6 to Equation 3, we see that the IDD

algorithm is successful in reducing the cost associated with
the hash tree traversal linearly. It also reduces the checking
cost from N VC;L tcheck down to N V CP ; PL tcheck . Note
that for suciently large L, VC;L ' C and V CP ; PL ' C=P .
This shows that IDD is also able to linearly reduce the
cost of checking at the leaf nodes, and thus unlike DD, it
performs no redundant work. The comparison of DD and
IDD in terms of the average number of distinct leaf node
visited per transaction is reported in our experiment (see
Figure 11 and discussions in Section V). However, P must
be relatively small for IDD to have a good load balance.
If P becomes large where M is xed, the problem of load
imbalance discussed in Section III makes some processors
work on more than 1=P of items in a transaction at the
root of the hash tree.
If the parallel architecture has hardware support for communication and computation to proceed concurrently and
the amount of computation in the subset function is significant, the data movement cost in IDD can be made to be
negligible. In the absence of such hardware support, the
cost of data movement in IDD is O(N ). Thus IDD is not
scalable with respect to N , but scales better than DD, as
IDD does not have redundant computations. Like DD,
IDD is also scalable with respect to increasing M .
The HD algorithm. In the HD algorithm, the number of potential candidates per transactions is C=G and the number
of candidates per processor is M=G. So the computation
time per transaction is:

Comparing Equation 5 with the serial complexity (Equation 3), we see that the DD algorithm does not reduce
the computation associated with the hash tree traversal.
For both the serial Apriori and the DD algorithm, this
cost is N C ttravers . However, the DD algorithm is
able to reduce the cost associated with the checking at
the leaf nodes. In particular, it reduces the serial cost of
N VC;L tcheck down to N VC; PL tcheck . However, because VC; PL > VC;L =P , the reduction achieved in this part
is less than a factor of P . We can easily see this if we consider the case when L is very large. In this case, VC; PL ' C
and VC;L =P ' C=P by Equation 2. Thus, the number of
leaf nodes checked over all the processors by the DD algorithm is higher than that of the serial algorithm. This is
why the DD algorithm performs redundant computation.
Furthermore, DD has an extra cost of data movement.
Due to these two factors, DD does not scale with respect
to increasing N . However, the cost of building hash tree
HD = C ttravers + V C L tcheck
scales down by a factor of P . Thus, DD is scalable with
Ttrans
G;G
G
respect to increasing M .
The IDD algorithm. In the IDD algorithm, just like the DD The total number of transactions each processor has to
algorithm, the average number of leaf nodes in the local process is GN=P . Thus the computation per processor is:
hash tree of each processor is L=P . However, the average
number of potential candidates that need to be checked Tcomp
HD = G N T HD +
O( M
)
+
trans
P
for each transaction at each processor is much less than
| {zG }
DD, because of the intelligent partitioning of candidates
hash tree construction

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

O( M
G)

| {z }

data movement global reduction

N C
= G
P G ttravers +

G N VC L t
check +
G;G
P
GN)
)
+
O
(
O( M
G
P

350

300

(7)

Compared to the serial algorithm, Equation 7 shows that

the HD algorithm reduces the computation linearly with
respect to the hash tree traversal cost. The traversal cost is
reduced from N C ttravers down to N C ttravers
P . The
cost of checking at the leaf nodes is reduced from N VC;L
tcheck down to ( G N V CG ; GL tcheck )=P . Note that for
suciently large L, N VC;L ' NC and GN V CG ; GL =P '
N C=P . Thus, the HD algorithm has a linear speedup
with respect to the cost of checking at the leaf nodes.
HD also has data movement cost. However, when P
is increased with increasing N , the cost is almost constant
provided G is unchanged. Thus HD is scalable with respect
to increasing N . Furthermore, HD scales with increasing
M provided G is chosen such that MG is constant.
We make a comparison of HD and CD using Equations 4
and 7. Equation 4 can be roughly summarized as O( NP ) +
O(M ), and Equation 7 can be similarly summarized as
O(G NP ) + O( MG ). We show the condition where the run
time for HD is less than that of CD, i.e.,

M
N
O(G N
P ) + O( G ) < O( P ) + O(M )

Solving for G, which is the number of candidate partitions

in HDD, gives the following:
1 < G < O( M N P )

(8)

Equation 8 shows that when M is relatively larger than N ,

HD can outperform CD by selecting wide range of G values. This equation also shows that as N becomes relatively
larger than M, HD can reduce G to have a performance advantage over CD. When N is very large compared to M P ,
HD can choose G to be 1 and becomes exactly same as CD.
V. Experimental Results

CD
IDD
HD
DD
DD + comm.

400

We implemented our parallel algorithms on a 128processor Cray T3E and SP2 parallel computers. Each
processor on the T3E is a 600 Mhz Dec Alpha (EV5),
and has 512 Mbytes of memory. The processors are interconnected via a three dimensional torus network that
has a peak unidirectional bandwidth of 430 Mbytes/second,
and a small latency. For communication we used the message passing interface (MPI). Our experiments have shown
that for 16 Kbyte messages we obtain a bandwidth of
303 Mbytes/second and an eective startup time of 16
microseconds. SP2 nodes consist of a Power2 processor
clocked 66.7 MHz with 128 Kbytes data cache, 32 Kbytes

Response time (sec.)

O( G P N ) +

250

200

150

100

0
0

60
80
Number of processors

100

120

140

Fig. 10. Scaleup result on Cray T3E with 50K transactions and 0.1%
minimum support.

instruction cache, 256-bit memory bus, 256 Mbytes real

memory and 1 Gbytes virtual memory. The SP High Performance Switch (HPS) has a theoretical maximum bandwidth of 110 Mbytes/second.
We generated a synthetic dataset using a tool provided
by [17] and described in [2]. The parameters for the data
set chosen are average transaction length of 15 and average size of frequent item sets of 6. Data sets with 1000
transactions (63Kbytes) were generated for dierent processors. Due to the absence of a true parallel I/O system
on the T3E system, we kept a set of transactions in a main
memory buer and read the transactions from the buer
instead of the actual disks. For the experiments involving larger data sets, we read the same data set multiple
times. We also performed similar experiments on an IBM
SP2 in which the entire database resided on disks. Our experiments (not reported here) show that the I/O requirements do not change the relative performance of the various
schemes. We do present the results of one experiment on
16-processor SP2 for comparing CD to IDD and HD when
CD scans database multiple times due to the partitioned
hash tree.
To compare the scalability of the four schemes (CD, DD,
IDD and HD), we performed scaleup tests with 50K transactions per processor and minimum support of 0.1%. With
minimum support of 0.1%, the entire candidate hash tree
t in the main memory of one T3E processor. For this experiment, in the HD algorithm we have set the threshold
on the number of candidates for switching to the CD algorithm to be 5K. With 0.1% support, the HD algorithm
switched to CD algorithm in pass 5 of total 12 passes, and
88.4% of the overall response time of the serial code was
spent in the rst 4 passes. These scaleup results are shown
in Figure 10.
As noted in [6], the DD algorithm scales very poorly.
However, the performance achieved by IDD is much better
than that of the DD algorithm. In particular, on 32 processors, IDD is faster than DD by a factor of 5.6. It can be seen
that the performance gap between IDD and DD widens as
the number of processors increases. IDD performs better

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

DD
IDD

HD
CD
IDD

1000

800

Response time (sec.)

Average Number of Distinct Leaf Node Visited Per Transaction

600

400

20
200
10

0
0

15
20
Number of processors

4
6
8
Number of Candidates (in milliions)

Fig. 11. Comparison of DD and IDD in terms of the average number Fig. 12. Response time on 16 processor IBM SP2 with 100K transof distinct leaf node visited per transaction with 50K transactions
actions as the minimum support varies from 0.1% to 0.025%.
per processor and 0.2% minimum support.

than DD because of the better communication mechanism

for data movements and the intelligent partitioning of the
candidate set. To show the eects of these two improvements, we replaced the communication mechanism of the
DD algorithm with that of the IDD. The scaleup result of
this improvement is shown as \DD+comm" in Figure 10.
Hence the response time reduction from DD to DD+comm
is due to the the better communication mechanism for data
movements, and the reduction from DD+comm to IDD
is due to the intelligent partitioning of the candidate set.
Same experiments of comparing DD, DD+comm, and IDD
on IBM SP2 also showed the similar pattern. We also show
the eect of IDD's intelligent partitioning over DD by actually counting the number of distinct leaf node visited by
both algorithms. We wan to verify that the average number of distinct leaf node visited by IDD is indeed much less
than DD. Figure 11 shows that V PC ; PL of IDD goes down by
factor of P , but VC; PL of DD does not go down by factor of
P.
Note that the response time of IDD increases as we increase the number of processors. This is due to the load
balancing problem discussed in Section III, where the number of candidates per processor decreases as the number of
processors increases. Looking at the performance of the HD
algorithm, we see that the response time remains almost
constant as we increase the number of processors while
keeping the number of transactions per processor and the
minimum support xed. Comparing against CD, we see
that HD actually performs better as the number of processors increases. Its performance on 128 processors is 16.5%
better than CD. This performance advantage of HD over
CD is due to the smaller cost of building candidate hash
tree and global reduction in HD.
In the previous experiment, we chose the minimum support high enough such that the entire candidate hash tree
ts in main memory. When the candidate hash tree does
not t in main memory, CD partitions it such that each
partition ts in the main memory. Now the entire set of local transactions have to be read at each processor as many

times as the number of partitions. This method increases

the I/O cost. On the system in which I/O is scalable and
fast (e.g., IBM SP2), this cost may be acceptable. We
implemented the CD algorithm to partition the hash tree
and read database multiple times in case the hash tree does
not t into main memory. Figure 12 shows the performance
comparison of CD, IDD and HD on 16-processor IBM SP2
machine as the number of candidates increases by lowering minimum support. Unlike the earlier experiments on
Cray T3E machine, the whole transactions were read in
from the le. Figure 12 shows that as the number of candidates increases both IDD and HD outperform CD. This is
due to the cost of building candidate hash tree, increased
I/O time required for multiple scan of the database and increased communication time required for global reduction
operation of multiple partitions of the candidate frequencies. Note that even on IBM SP2, the penalty due to these
overheads is about 8% for 1 million candidates, 11% for
3 million candidates and 25% for 11 million candidates.
For this particular experiments, the overhead of building
the hash tree was the dominant cost. However, on systems with slower I/O, the I/O penalty can be substantial
in addition to the overhead of building the hash tree.
In order to study the scalability of these algorithms, we
performed experiments on T3E with varying number of
processors (P ), candidates (M ), and transactions (N ). For
these experiments, we measured performance for computing size 3 frequent item sets only, as the computation for
size 3 item sets took more than 55% of the total run time.
Figure 13 shows the speedup of three algorithms as P is
increased from 4 to 64 with N = 1:3 million and M = 0:7
million. Note that the whole candidate hash tree t in
main memory and thus CD algorithm read in transactions
only once. The gure clearly shows that the HD algorithm achieves better speedup than CD and IDD, and the
dierence in performance increases for larger number of
processors. The reason for CD's poor speedup is the serial bottleneck of hash tree construction and global reduction operation. For 4 processors, the time taken for hash
tree construction is only 3.1% of the total runtime and the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

500
CD
IDD
HD

CD
IDD
HD

450
400
350

Response time (sec.)

Speedup

300
250
200
150
100

50
0

0
0

30
40
Number of processors

10
15
20
Number of transactions (in millions)

Fig. 13. Speedup of three algorithms on Cray T3E as P is increased Fig. 14. Runtime of three algorithms on Cray T3E as N is increased
from 4 to 64 with N = 1:3 million and M = 0:7 million. The
from 1.3 million to 26.1 million with M = 0:7 million and P = 64.
processor congurations for HD were 8 2 for 16 processors, 8 4
The processor conguration for HD was 8 8.
for 32 processors, and 8 8 for 64 processors.
120

100

Response time (sec.)

time for global reduction is only 1.6% of the total runtime.

However, for 64 processors, these overheads are 24.8% and
31.0%, respectively. On the other hand, IDD has poor
speedup due to the load imbalance, and data movement
cost. For this particular experiment, the dominant overhead is load imbalance. In particular, for 4 processors the
load imbalance overhead is only 6.3%, whereas for 64 processors this overhead is 49.6%. The cost of data movement
is 1.0% for 4 processors and 6.4% for 64 processors. The
processor conguration chosen for HD was 8 8 for 64
processors. Hence, HD performed one eighth of CD's reduction operation and moved only one eighth of the data
among groups of 8 processors only.
In the next experiment, we xed P and M , and varied
N from 1.3 million to 26.1 million. Figure 14 shows the
runtime of this experiment. The gure shows that CD and
HD scale nicely with the increasing number of transactions.
However, with xed M and P , IDD suers from the load
imbalance problem. In addition to that, the cost of data
movement adds up as N is increased. However, this data
movement cost is only 6.1% of the total runtime for 1.3
million candidate sets and 7.1% for 26.1 million candidate
sets. Hence, the majority of runtime dierence between
IDD and the other two algorithms is due to the load imbalance.
The nal experiment compares the runtime of three algorithms as M is increased from 0.7 million to 8.0 million
with xed N and P . The main memory of T3E was large
enough to hold 0.7 million candidate sets. In CD, for the
candidate size of greater than 0.7 million, the candidate set
is partitioned and subset function was repeatedly called on
the partitioned candidate sets. Figure 15 shows the runtime of this experiment. The gure shows that the performance gap between CD and HD widens as the number
of candidate sets increases. This is due to the fact that
CD has O(M ) component in its runtime. HD scales with
respect to M as it has O( MG ) which is constant and O( MP )
as M becomes much larger. For smaller size of M , IDD
performs worse than CD. As M increases, the performance

CD
IDD
HD

0
0

3
4
5
6
Number of candidates (in millions)

Fig. 15. Runtime of three algorithms on Cray T3E as M is increased

from 0.7 million to 8.0 million with N = 1:3 million and P = 64.
The processor congurations for HD were as follow. 8 8 for
M = 0:7 million, 16 4 for M = 1:7 million, 32 2 for M = 2:3
million, and 64 1 for M 3:3 million.

of IDD improves and eventually outperforms CD. This is

due to the fact that IDD has O( MP ) component in its runtime compared to O(M ) of CD. Note that HD algorithm
behaves exactly the same as IDD for the candidate set size
of 3.3 million and more. This experiment shows that when
M is much larger than N , IDD and HD are much better
algorithms than CD.
For these experiments, just like the previous experiments
on T3E, we simulated I/O and assumed that I/O cost is
negligible compared to the computation cost. Even though
CD algorithms repeatedly read transactions, no actual I/O
was performed. However, when the I/O cost is factored in,
the performance of CD would be worse than reported in
these experiments.
VI. Conclusion

In this paper, we proposed two new parallel algorithms

for mining association rules. The IDD algorithm eectively
parallelizes the step of building hash tree, and is thus scalable with respect to the increasing candidate set size. This

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO. Y, MONTH 1999

algorithm also utilizes total main memory available more

eectively than the CD algorithm. This is important if the
I/O cost becomes dominant due to slow I/O system. The
IDD algorithm improves over the DD algorithm which has
high communication overhead and redundant work. As
shown in Section IV, for each transaction, the DD algorithm performs substantially more work overall than the
serial Apriori algorithm. The communication and idling
overheads were reduced using a better data movement communication mechanism, and redundant work was reduced
by partitioning the candidate set intelligently and using bit
maps to prune away unnecessary computation. Another
useful feature of IDD is that it is well suited for the system
environment with single source of data base. For instance,
when all the data is coming from a database server or a
single le system, one processor can read data from the
single source and pass the data along the communication
pipeline dened in the algorithm. However, as the number
of available processors increases, the eciency of this algorithm decreases due to load imbalance. Furthermore, IDD
also suers from O(N ) cost due to the communication of
transactions, and hence is unscalable with respect to the
number of transactions.
HD combines the advantages of CD and IDD. It is an
improvement over CD, as it partitions the hash tree and
thus avoids O(M ) cost of hash tree construction and global
reduction. At the same time, it is an improvement over
IDD, as it does not move data among all the processors, but
only among a smaller subset of processors. Furthermore,
HD achieves better load balancing than IDD, because the
candidate set is partitioned into fewer buckets.
The experimental results on a 128-processor Cray T3E
parallel machine show that the HD algorithm scales just
as well as the CD algorithm with respect to the number
of transactions, and scales as well as IDD with respect
to increasing candidate set size. However, it outperforms
CD when the number of candidate itemsets is large, and
outperforms IDD when the number of transactions is very
large.
Acknowledgments

This work was supported by NSF grant ASC-9634719,

Army Research Oce contract DA/DAAH04-95-1-0538,
Cray Research Inc. Fellowship, and IBM partnership
award, the content of which does not necessarily re ect
the policy of the government, and no ocial endorsement
should be inferred. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute,
Cray Research Inc., and NSF grant CDA-9414015.
References
[1] M. Stonebraker, R. Agrawal, U. Dayal, E. J. Neuhold, and
A. Reuter, \DBMS research at a crossroads: The vienna update," in Proc. of the 19th VLDB Conference, Dublin, Ireland,
1993, pp. 688{692.
[2] R. Agrawal and R. Srikant, \Fast algorithms for mining association rules," in Proc. of the 20th VLDB Conference, Santiago,
Chile, 1994, pp. 487{499.
[3] M. A. W. Houtsma and A. N. Swami, \Set-oriented mining for

[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]

association rules in relational databases," in Proc. of the 11th

Int'l Conf. on Data Eng., Taipei, Taiwan, 1995, pp. 25{33.
A. Savasere, E. Omiecinski, and S. Navathe, \An ecient algorithm for mining association rules in large databases," in Proc.
of the 21st VLDB Conference, Zurich, Switzerland, 1995, pp.
432{443.
R. Srikant and R. Agrawal, \Mining generalized association
rules," in Proc. of the 21st VLDB Conference, Zurich, Switzerland, 1995, pp. 407{419.
R. Agrawal and J.C. Shafer, \Parallel mining of association
rules," IEEE Transactions on Knowledge and Data Eng., vol.
8, no. 6, pp. 962{969, December 1996.
E.H. Han, G. Karypis, and V. Kumar, \Scalable parallel data
mining for association rules," in Proc. of 1997 ACM-SIGMOD
Int. Conf. on Management of Data, Tucson, Arizona, 1997.
R. Agrawal, T. Imielinski, and A. Swami, \Mining association
rules between sets of items in large databases," in Proc. of 1993
ACM-SIGMOD Int. Conf. on Management of Data, Washington, D.C., 1993.
Vipin Kumar, Ananth Grama, Anshul Gupta, and George
Karypis, Introduction to Parallel Computing: Algorithm Design
and Analysis, Benjamin Cummings/ Addison Wesley, Redwod
City, 1994.
C. H. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, Englewood
Clis, NJ, 1982.
Takahiko Shintani and Masaru Kitsuregawa, \Hash based parallel algorithms for mining association rules," in Proc. of the
Conference on Paralellel and Distributed Information Systems,
1996.
J.S. Park, M.S. Chen, and P.S. Yu, \Ecient parallel data mining for association rules," in Proceedings of the 4th Int'l Conf.
on Information and Knowledge Management, 1995.
D. Cheung, V. Ng, A. Fu, and Y. Fu, \Ecient mining of association rules in distributed databases," IEEE Transactions on
Knowledge and Data Eng., vol. 8, no. 6, pp. 911{922, 1996.
Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori
Ogihara, and Wei Li, \New parallel algorithms for fast discovery
of association rules," Data Mining and Knowledge Discovery:
An International Journal, vol. 1, no. 4, 1997.
J.S. Park, M.S. Chen, and P.S. Yu, \An eective hash-based
algorithm for mining association rules," in Proc. of 1995 ACMSIGMOD Int. Conf. on Management of Data, 1995.
Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori
Ogihara, and Wei Li, \New algorithms for fast discovery of
association rules," in Proc. of the Third Int'l Conference on
Knowledge Discovery and Data Mining, 1997.
IBM
Quest
Data
Mining
Project,
\Quest
synthetic
data
generation
code,"
http://www.almaden.ibm.com/cs/quest/syndata.html, 1996.

Machine Trading: Deploying Computer Algorithms to Conquer the Markets
From Everand
Machine Trading: Deploying Computer Algorithms to Conquer the Markets
Ernest P. Chan
4/5 (2)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
NetOps 2.0 Transformation: The DIRE Methodology
From Everand
NetOps 2.0 Transformation: The DIRE Methodology
Ray Belleville
5/5 (1)
Learn Design and Analysis of Algorithms in 24 Hours
From Everand
Learn Design and Analysis of Algorithms in 24 Hours
Alex Nordeen
No ratings yet
Assoc Parallel
No ratings yet
Assoc Parallel
12 pages
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet
Production System: Fundamentals and Applications
From Everand
Production System: Fundamentals and Applications
Fouad Sabry
No ratings yet
Google BigQuery Analytics
From Everand
Google BigQuery Analytics
Jordan Tigani
3/5 (1)
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
The Art of Clean Code: Best Practices to Eliminate Complexity and Simplify Your Life
From Everand
The Art of Clean Code: Best Practices to Eliminate Complexity and Simplify Your Life
Christian Mayer
No ratings yet
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
From Everand
Lexicon of Computer Science Terminology: Lexicon of Tech and Business, #16
Mustafa Al-Dori
4/5 (1)
Efficient Memory Optimization for IoT Intrusion Detection
From Everand
Efficient Memory Optimization for IoT Intrusion Detection
Ethan Evelyn
No ratings yet
Manufacturing: Engineering, Management and Marketing
From Everand
Manufacturing: Engineering, Management and Marketing
S.O.T Ogaji
No ratings yet
Algorithms and Data Structures: An Easy Guide to Programming Skills
From Everand
Algorithms and Data Structures: An Easy Guide to Programming Skills
Rigdon Jonathan
No ratings yet
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
From Everand
AP Computer Science Principles: Student-Crafted Practice Tests For Excellence
Sama Alshatali
No ratings yet
Performance Evaluation of Apriori and FP-Growth Al
No ratings yet
Performance Evaluation of Apriori and FP-Growth Al
5 pages
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
From Everand
Preliminary Specifications: Programmed Data Processor Model Three (PDP-3) October, 1960
Digital Equipment Corporation
No ratings yet
Edge Computing Applications in Supply Chain Management
From Everand
Edge Computing Applications in Supply Chain Management
Bo Li
No ratings yet
What's New in .NET 8? A Complete Guide to the Latest Features
From Everand
What's New in .NET 8? A Complete Guide to the Latest Features
Nitika
No ratings yet
I Jcs It 2014050535
No ratings yet
I Jcs It 2014050535
5 pages
C++ for Finance: Writing Fast and Reliable Trading Algorithms
From Everand
C++ for Finance: Writing Fast and Reliable Trading Algorithms
Robert Johnson
No ratings yet
Intelligent Document Capture with Ephesoft
From Everand
Intelligent Document Capture with Ephesoft
Pat Myers
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
Mastering Data Structures and Algorithms in Python & Java
From Everand
Mastering Data Structures and Algorithms in Python & Java
Sachin Naha
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Fuzzy Systems: Fundamentals and Applications
From Everand
Fuzzy Systems: Fundamentals and Applications
Fouad Sabry
No ratings yet
Enterprise Artificial Intelligence Transformation
From Everand
Enterprise Artificial Intelligence Transformation
Rashed Haq
No ratings yet
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
From Everand
DeepSeek vs. ChatGPT – Why DeepSeek is the Superior AI.
Gary Thatcher
No ratings yet
Forward Chaining: Fundamentals and Applications
From Everand
Forward Chaining: Fundamentals and Applications
Fouad Sabry
No ratings yet
LOTED: a semantic web portal for the management of tenders from the European Community
From Everand
LOTED: a semantic web portal for the management of tenders from the European Community
Francesco Valle
No ratings yet
Comparative Evaluation of Association Rule Mining Algorithms With Frequent Item Sets
No ratings yet
Comparative Evaluation of Association Rule Mining Algorithms With Frequent Item Sets
7 pages
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Cloud Governance
From Everand
Cloud Governance
Ernie Zibert
3.5/5 (2)
4 Association
No ratings yet
4 Association
66 pages
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
CBAR: An Efficient Method For Mining Association Rules: Yuh-Jiuan Tsay, Jiunn-Yann Chiang
No ratings yet
CBAR: An Efficient Method For Mining Association Rules: Yuh-Jiuan Tsay, Jiunn-Yann Chiang
7 pages
SEMESTER Fall 2020: Name: Ayesha Asghar
No ratings yet
SEMESTER Fall 2020: Name: Ayesha Asghar
8 pages
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
From Everand
Machine Learning in the AWS Cloud: Add Intelligence to Applications with Amazon SageMaker and Amazon Rekognition
Abhishek Mishra
No ratings yet
Microprediction: Building an Open AI Network
From Everand
Microprediction: Building an Open AI Network
Peter Cotton
No ratings yet
Aqrat Al 2018
No ratings yet
Aqrat Al 2018
32 pages
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
From Everand
Advanced Dynamic-System Simulation: Model Replication and Monte Carlo Studies
Granino A. Korn
No ratings yet
The IT4IT™ Reference Architecture, Version 2.1
From Everand
The IT4IT™ Reference Architecture, Version 2.1
The Open Group
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
No ratings yet
Jalali@mshdiua - Ac.ir Jalali - Mshdiau.ac - Ir: Data Mining
33 pages
Practice Questions for UiPath Certified RPA Associate Case Based
From Everand
Practice Questions for UiPath Certified RPA Associate Case Based
Exam OG
No ratings yet
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
From Everand
Profit Driven Business Analytics: A Practitioner's Guide to Transforming Big Data into Added Value
Wouter Verbeke
No ratings yet
Notes On Association Rules
No ratings yet
Notes On Association Rules
3 pages
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
A Novel Approach To Mine Frequent Item Sets of Process Models For Dyeing Process Using Association Rule Mining
No ratings yet
A Novel Approach To Mine Frequent Item Sets of Process Models For Dyeing Process Using Association Rule Mining
7 pages
C++ Data Structures Explained: A Practical Guide with Examples
From Everand
C++ Data Structures Explained: A Practical Guide with Examples
William E. Clark
No ratings yet
Applied Data Mining
100% (1)
Applied Data Mining
284 pages
Data Mining Report
100% (1)
Data Mining Report
15 pages
Dm Unit 2
No ratings yet
Dm Unit 2
330 pages
Industrial Cases in Simulation Modeling
From Everand
Industrial Cases in Simulation Modeling
James A. Chisman PhD
No ratings yet
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
From Everand
Mastering Algorithms for Competitive Programming: Unlock the Secrets of Expert-Level Skills
Larry Jones
No ratings yet
Performance Evaluation of Sequential and Parallel Mining of Association Rules Using Apriori Algorithms
No ratings yet
Performance Evaluation of Sequential and Parallel Mining of Association Rules Using Apriori Algorithms
6 pages
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
The Science of Sourcing Governance
From Everand
The Science of Sourcing Governance
Ernie Zibert
No ratings yet
2-D SIMD Algorithms in The Perfect Shue Networks: N P N P L N P P L I PE
No ratings yet
2-D SIMD Algorithms in The Perfect Shue Networks: N P N P L N P P L I PE
16 pages
Forwardingindices of Folded: N-Cubes
No ratings yet
Forwardingindices of Folded: N-Cubes
3 pages
The Hamiltonicity of Crossed Cubes in The Presence of Faults
No ratings yet
The Hamiltonicity of Crossed Cubes in The Presence of Faults
7 pages
Forwardingindices of Folded: N-Cubes
No ratings yet
Forwardingindices of Folded: N-Cubes
3 pages
Optimal Communication Algorithms For Hypercubes : Journal of Parallel and Distributed Computing 11, 263-275 (1991)
No ratings yet
Optimal Communication Algorithms For Hypercubes : Journal of Parallel and Distributed Computing 11, 263-275 (1991)
13 pages
Scheduling Threads For Constructive Cache Sharing On Cmps
No ratings yet
Scheduling Threads For Constructive Cache Sharing On Cmps
11 pages
Using Page Residency To Balance Tradeoffs in Tracing Garbage Collection
No ratings yet
Using Page Residency To Balance Tradeoffs in Tracing Garbage Collection
11 pages
Simple Reconstruction of Binary Near-Perfect Phylogenetic Trees
No ratings yet
Simple Reconstruction of Binary Near-Perfect Phylogenetic Trees
8 pages
Ble 90
No ratings yet
Ble 90
268 pages
Scan Primitives For Vector Computers
No ratings yet
Scan Primitives For Vector Computers
10 pages
A Provably Time-Efficient Parallel Implementation of Full Speculation
No ratings yet
A Provably Time-Efficient Parallel Implementation of Full Speculation
46 pages
Algorithms For Efficient Near-Perfect Phylogenetic Tree Reconstruction in Theory and Practice
No ratings yet
Algorithms For Efficient Near-Perfect Phylogenetic Tree Reconstruction in Theory and Practice
25 pages
Parallel Thinking: Guy Blelloch Carnegie Mellon University
No ratings yet
Parallel Thinking: Guy Blelloch Carnegie Mellon University
37 pages
Ble 90
No ratings yet
Ble 90
268 pages
Provably Efficient Scheduling For Languages With Fine-Grained Parallelism
No ratings yet
Provably Efficient Scheduling For Languages With Fine-Grained Parallelism
41 pages
Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction
No ratings yet
Fixed Parameter Tractability of Binary Near-Perfect Phylogenetic Tree Reconstruction
12 pages
Strongly History-Independent Hashing With Applications
No ratings yet
Strongly History-Independent Hashing With Applications
11 pages
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
No ratings yet
Provably Good Multicore Cache Performance For Divide-and-Conquer Algorithms
10 pages
MIKE Powered by DHI GPU Guidelines
No ratings yet
MIKE Powered by DHI GPU Guidelines
22 pages
MCA First Sem Syllabus 2009-10
No ratings yet
MCA First Sem Syllabus 2009-10
9 pages
Java Muti Threading
No ratings yet
Java Muti Threading
12 pages
Research About CPU
No ratings yet
Research About CPU
11 pages
AI ML DS Syllabus-Sem-Vi-Mumbai-University
No ratings yet
AI ML DS Syllabus-Sem-Vi-Mumbai-University
42 pages
COA Chapter6
No ratings yet
COA Chapter6
9 pages
Computer Cluster Presentation
No ratings yet
Computer Cluster Presentation
14 pages
EREW
No ratings yet
EREW
3 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Advanced Computer Systems (
No ratings yet
Advanced Computer Systems (
24 pages
Data Warehouse and Data Mining Notes
No ratings yet
Data Warehouse and Data Mining Notes
31 pages
Cciot Pfe
No ratings yet
Cciot Pfe
15 pages
Multithreading Algorithms
No ratings yet
Multithreading Algorithms
36 pages
Akhir Sman - 17 - Batam 2019 07 07
No ratings yet
Akhir Sman - 17 - Batam 2019 07 07
15 pages
Term Paper About Computer Architecture
100% (1)
Term Paper About Computer Architecture
6 pages
Lecture1 PDF
No ratings yet
Lecture1 PDF
3 pages
Computer Science and Engineering PDF
No ratings yet
Computer Science and Engineering PDF
17 pages
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
No ratings yet
4-Module #4-Shared-Memory-Students-Version-Final-October-24-2024
25 pages
Compare Between Grid and Cluster
100% (1)
Compare Between Grid and Cluster
12 pages
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
No ratings yet
Star Lion College of Engineering & Technology: Cs2354 Aca-2 Marks & 16 Marks
14 pages
A Low Power Selective Median Filter Design
No ratings yet
A Low Power Selective Median Filter Design
69 pages
Threads
No ratings yet
Threads
10 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
100% (4)
Parallel Programming for Multicore and Cluster Systems 3rd Edition Thomas Rauber download
72 pages
HRMS and BEN 11i and R12 Tuning and Health Check
No ratings yet
HRMS and BEN 11i and R12 Tuning and Health Check
75 pages
Unit 1 Os
No ratings yet
Unit 1 Os
13 pages
Towards A Library of Parallel Graph Algorithms in Java
No ratings yet
Towards A Library of Parallel Graph Algorithms in Java
8 pages
DST4030A Lecture Notes Week 4
No ratings yet
DST4030A Lecture Notes Week 4
42 pages
There Is No Global Clock in Parallel Computing
No ratings yet
There Is No Global Clock in Parallel Computing
7 pages
33-A-Diakoptics For The Multicore Sequential-Time Simulation of Microgrids Within Large Distribution Systems
No ratings yet
33-A-Diakoptics For The Multicore Sequential-Time Simulation of Microgrids Within Large Distribution Systems
9 pages

Assoc Parallel Journal

Uploaded by

Assoc Parallel Journal

Uploaded by

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. XX, NO.

Scalable Parallel Data Mining for Association Rules

tribution (DD). The CD algorithm scales linearly and has

Transactions from supermarket.

we de ne the support count of C with respect to T to be:

(C ) = jftjt 2 T; C  tgj:

implementation of the second step is straightforward and

Candidate Hash Tree

Fig. 3. Subset operation on the left most subtree of the root of a

Candidate Hash Tree

Fig. 2. Subset operation on the root of a candidate hash tree.

In this section, we will focus on the parallelization of the

individual counts using a global reduction operation [9].

Having computed the counts of its candidate item-sets,

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Fig. 4. Count Distribution (CD) Algorithm

Candidate Hash Tree

N: number of data items

Candidate Hash Tree

Candidate Hash Tree

M: size of candidate set

Fig. 5. Data Distribution (DD) Algorithm

Candidate Hash Tree

small number of items in the candidate sets. However,

Fig. 6. Pseudo Code for Data Movements

a transaction against this subset to determine if the hash

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Fig. 7. Intelligent Data Distribution (IDD) Algorithm

Candidate Hash Tree

Fig. 8. Subset operation on the root of a candidate hash tree in IDD.

One of the criteria of a good partitioning involved here is

tween two processors and assign all the candidates starting

more than the rst items of the candidate item sets. In

duction operation among the P=G groups of processors.

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Step 2: Reduction Operation Along the Rows

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Candidate Hash Tree

Step 3: All-to-all Broadcast Operation Along the Columns

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Frequent Item Set

Fig. 9. Hybrid Distribution (HD) Algorithm in 3 4 Processor Mesh (G = 3; P = 12)

amount of data movement in this algorithm has been cut

we dene the support count of C with respect to T to be:

(C ) = jftjt 2 T; C tgj: