Assoc Parallel
Assoc Parallel
George Karypis
Abstract
One of the important problems in data mining is discovering association rules from databases of transactions where
each transaction consists of a set of items. The most time
consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting
subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user dened minimum support. Even with
the pruning, the task of nding all association rules requires
a lot of computation power and time. Parallel computers
oer a potential solution to the computation requirement
of this task, provided ecient and scalable parallel algorithms can be designed. In this paper, we present two new
parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm eciently uses aggregate
memory of the parallel computer by employing intelligent
candidate partitioning scheme and uses ecient communication mechanism to move data among the processors. The
Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance.
The experimental results on a Cray T3D parallel computer
show that the Hybrid Distribution algorithm scales linearly
and exploits the aggregate memory better and can generate
more association rules with a single scan of database per
pass.
1 Introduction
One of the important problems in data mining [SAD+93] is
discovering association rules from databases of transactions,
This work was supported by NSF grant ASC-9634719, Army
Research Oce contract DA/DAAH04-95-1-0538, Cray Research
Inc. Fellowship, and IBM partnership award, the content of
which does not necessarily re
ect the policy of the government,
and no ocial endorsement should be inferred. Access to computing facilities was provided by AHPCRC, Minnesota Supercomputer Institute, Cray Research Inc., and NSF grant CDA-9414015.
See http://www.cs.umn.edu/han/papers.html#DataMiningPapers
for an extended version of this paper and other related papers.
Vipin Kumar
2 Basic Concepts
Let T be the set of transactions where each transaction is
a subset of the item-set I . Let C be a subset of I , then we
dene the support count of C with respect to T to be:
(C ) = jftjt 2 T; C tgj:
An association rule is an expression of the form X =s;
) Y,
where X I and Y I . The support s of the rule X =s;
)Y
is dened as (X [ Y )=jT j, and the condence is dened
as (X [ Y )=(X ). For example, consider a rule f1 2g =)
f3g, i.e. items 1 and 2 implies 3. The support of this rule is
the frequency of the item-set f1 2 3g in the transactions. For
example, a support of 0.05 means that 5% of the transactions contain f1 2 3g. The condence of this rule is dened
as the ratio of the frequencies of f1 2 3g and f1 2g. For
example, if 10% of the transactions contain f1 2g, then the
condence of the rule is 0:05=0:10 = 0:5. A rule that has a
very high condence (i.e., that is close to 1.0) is often very
important, because it provides an accurate prediction on the
association of the items in the rule. The support of a rule
is also important, since it indicates how frequent the rule is
in the transactions. Rules that have very small support are
often uninteresting, since they do not describe signicantly
large populations. This is one of the reasons why most algorithms disregard any rules that do not satisfy the minimum
support condition specied by the user. This ltering due
to the minimum required support is also critical in reducing the number of derived association rules to a manageable
size.
The task of discovering an association rule is to nd all
rules X =s;
) Y , where s is at least a given minimum support threshold and is at least a given minimum condence
threshold. The association rule discovery is composed of
two steps. The rst step is to discover all the frequent
item-sets (candidate sets that has more support than the
minimum support threshold specied) and the second step
is to generate association rules that have higher condence
than the minimum condence threshold from these frequent
item-sets.
A number of algorithms have been developed for discovering association rules [AIS93, AS94, HS95]. Our parallel
algorithms are based on the Apriori algorithm [AS94] that
has smaller computational complexity compared to other algorithms. In the rest of this section, we brie
y describe the
Apriori algorithm. The reader should refer to [AS94] for
further details.
1.
2.
3.
4.
5.
6.
7.
8.
F1 = f frequent 1-item-setsg ;
for ( k = 2; Fk 1 6= ; k++ ) do begin
Ck = apriori gen(Fk 1 )
for all transactions t 2 T
subset(Ck , t)
Fk = fc 2 Ck j c.count minsupg
end
S
Answer = Fk
Hash Function
1,4,7
3,6,9
2,5,8
Transaction
1 +
2356
2 +
356
3 +
56
12356
234
567
145
136
345
356
367
357
368
689
124
125
457
458
159
Transaction
1 +
2356
2 +
356
3 +
56
12356
3 5 6
1 3 +
5 6
234
1 5 +
567
145
136
345
356
367
357
368
689
124
125
457
458
159
3 Parallel Algorithms
In this section, we will focus on the parallelization of the
rst task that nds all frequent item-sets. We rst discuss
two parallel algorithms proposed in [AS96] to help motivate
our parallel formulations. In all our discussions, we assume
that the transactions are evenly distributed among the processors.
3.1 Count Distribution Algorithm
In the Count Distribution (CD) algorithm proposed in [AS96],
each processor computes how many times all the candidates
appear in the locally stored transactions. This is done by
building the entire hash tree that corresponds to all the candidates and then performing a single pass over the locally
stored transactions to collect the counts. The global counts
of the candidates are computed by summing these individual counts using a global reduction operation [KGGK94].
This algorithm is illustrated in Figure 4. Note that since
each processor needs to build a hash tree for all the candidates, these hash trees are identical at each processor. Thus,
excluding the global reduction, each processor in the CD algorithm executes the serial Apriori algorithm on the locally
stored transactions.
This algorithm has been shown to scale linearly with the
number of transactions [AS96]. This is because each processor can compute the counts independently of the other
processors and needs to communicate with the other processors only once at the end of the computation step. However, this algorithm works well only when the hash trees can
t into the main memory of each processor. If the number
of candidates is large, then the hash tree does not t into
the main memory. In this case, this algorithm has to partition the hash tree and compute the counts by scanning
the database multiple times, once for each partition of the
hash tree. Note that the number of candidates increases if
either the number of distinct items in the database increases
or if the minimum support level of the association rules decreases. Thus the CD algorithm is eective for small number
of distinct items and a high minimum support level.
Proc 0
Proc 1
Proc 2
Data
Data
{A,B}
{A,C}
{A,D}
{B,C}
{B,E}
{C,D}
{D,E}
2
3
3
2
2
3
3
N/P
N/P
Count
Count
Count
Count
Data
Data
N/P
N/P
Proc 3
{A,B}
{A,C}
{A,D}
{B,C}
{B,E}
{C,D}
{D,E}
1
2
3
1
2
2
5
{A,B}
{A,C}
{A,D}
{B,C}
{B,E}
{C,D}
{D,E}
3
3
4
2
4
3
1
{A,B}
{A,C}
{A,D}
{B,C}
{B,E}
{C,D}
{D,E}
2
1
3
5
2
3
2
Global Reduction
N: number of data items
M: size of candidate set
P: number of processors
Proc 0
Local Data
Data
Proc 1
Remote Data
Local Data
N/P
Remote Data
Count
{A,B} 2
{B,C} 3
{C,E} 3
Proc 3
Remote Data
Local Data
Data N/P
Broadcast
Count
Local Data
N/P
Data
Broadcast
Proc 2
Count
{A,C} 3
{B,D} 5
{C,F} 1
Data
Data N/P
Broadcast
Count
Remote Data
Broadcast
Broadcast
Count
Count
{A,D} 2
{B,E} 3
{D,E} 4
Count
Count
{A,E} 1
{C,D} 1
{E,F} 1
All-to-all Broadcast
while (!done) f
FillBuer(fd, SBuf);
for (k = 0; k < P-1; ++k) f
/* send/receive data in non-blocking pipeline */
MPI Irecv(RBuf, left);
MPI Isend(SBuf, right);
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);
MPI Waitall();
/* swap two buers */
tmp = SBuf;
SBuf = RBuf;
RBuf = tmp;
g
/* process transactions in SBuf and update hash tree */
Subset(HTree, SBuf);
Proc 0
Local Data
Proc 1
Remote Data
Local Data
Data N/P
Data
Shift
Shift
Count
Count
M/P
{A,B} 2
{A,C} 3
{C,E} 3
Remote Data
N/P
Local Data
Data
Proc 3
Remote Data
N/P
Count
Count
Data
N/P
Count
Count
Shift
Count
Count
Bit Map
D
Remote Data
Shift
Bit Map
B,E
M/P
Local Data
Data
Shift
Bit Map
A,C
Candidate Hash Tree
Proc 2
{D,E} 2
{D,F} 3
{D,G} 4
Bit Map
F,G
Candidate Hash Tree
M/P
{F,G} 3
{G,I} 4
{G,J} 2
All-to-all Broadcast
N: number of data items
M: size of candidate set
P: number of processors
Step 1: Partitioning of Candidate Sets and Data Movement Along the Columns
Candidate Hash Tree
A,B
A,B
A,B
A,B
D,E
D,E
D,E
D,E
G,F
G,F
G,F
G,F
Data Shift
Data Shift
Data Shift
Data Shift
B,C
E,F
H,I
Data Shift
B,C
E,F
H,I
Data Shift
B,C
E,F
H,I
B,C
Data Shift
E,F
H,I
Data Shift
Data Shift
Data Shift
Data Shift
C,D
C,D
C,D
C,D
F,G
F,G
F,G
F,G
F,H
F,H
F,H
F,H
A,B
A,B
A,B
A,B
D,E
D,E
D,E
D,E
G,F
G,F
G,F
G,F
B,C
B,C
B,C
B,C
E,F
E,F
E,F
E,F
H,I
H,I
H,I
H,I
C,D
C,D
C,D
C,D
F,G
F,G
F,G
F,G
F,H
F,H
F,H
F,H
F,H
F,H
F,H
F,H
H,I
H,I
H,I
H,I
H,I
F,H
F,H
F,H
H,I
H,I
H,I
F,H
F,H
F,H
F,H
H,I
H,I
H,I
H,I
Data Shift
count
intelligent data
hybrid
data
2000
1500
Response time (sec.)
that the processor in the rst column of the same row has
the total counts for the candidates in the same row processors. In the step 3, all the processors in the rst column
generate frequent set from the candidate set and perform
all-to-all broadcast operation along the rst column of the
processor mesh. Then the processors in the rst column
broadcast the full frequent sets to the processors along the
same row using one-to-all broadcast operation [KGGK94].
At this point, all the processors have the frequent sets and
ready to proceed to the next pass.
This algorithm inherits all the good features of the IDD
algorithm. It also provides good load balance and enough
computation work by maintaining minimum number of candidates per processor. At the same time, the amount of data
movement in this algorithm has been cut down to 1=G of the
IDD.
1000
500
4 Experimental Results
We implemented our parallel algorithms on a 128-processor
Cray T3D parallel computer. Each processor on the T3D is
a 150Mhz Dec Alpha (EV4), and has 64Mbytes of memory.
The processors are interconnected via a three dimensional
torus network that has a peak unidirectional bandwidth of
150Mbytes per second, and a small latency. For communication we used the message passing interface (MPI). Our
experiments have shown that for 16Kbytes we obtain a bandwidth of 74Mbytes/seconds and an eective startup time of
150 microseconds.
We generated a synthetic dataset using a tool provided
by [Pro96] and described in [AS94]. The parameters for the
data set chosen are average transaction length of 15 and average size of frequent item sets of 6. Data sets with 1000
transactions (6.3KB) were generated for dierent processors.
Due to the disk limitations of the T3D system we have kept
the small transactions in the buer and read the transactions from the buer instead of the actual disks. For the
experiments involving larger data sets, we read the same
data set multiple times.1
We performed scaleup tests with 100K transactions per
processor and minimum support of 0.25%. We could not
use lower minimum support because the CD algorithm ran
out of main memory. For this experiment, in the IDD and
HD algorithms we have set the minimum number of candidates for switching to the CD algorithm very low to show
the validity of our approaches. With 0.25% support, both
algorithms switched to CD algorithm in pass 7 of total 12
passes and 90.7% of the overall response time of the serial
code was spent in the rst 6 passes. These scaleup results
are shown in Figure 9.
As noted in [AS96], the CD algorithm scales very well.
Looking at the performance obtained by IDD, we see that
its response time increases as we increase the number of
processors. This is due to the load balancing problem discussed in Section 3, where the number of candidates per
processor decreases as the number of processors increases.
However, the performance achieved by IDD is much better than that of the DD algorithm of [AS96]. In particular,
IDD has 4.4 times less response time than DD on 32 processors. It can be seen that the performance gap between IDD
and DD is widening as the number of processors increases.
This is due to the improvement we made on IDD with the
1 We also performed similar experiments on an IBM SP2 in which
the entire database resided on disks. Our experiments show that
the I/O requirements do not change the relative performance of the
various schemes.
20
40
60
80
Number of processors
100
120
140
3000
count
intelligent data
hybrid
2500
500
count
intelligent data
hybrid
simple hybrid
450
400
1500
350
Response time (sec.)
2000
1000
500
(2408 K)
300
250
(1083 K)
200
150
0
0
200
400
600
Number of transactions per processor (K)
800
1000
(345 K)
100
Figure 10: Sizeup result with 16 processors and 0.25% minimum support.
50
(211 K)
0
0.5
5 Conclusion
In this paper, we proposed two parallel algorithms for mining association rules. The IDD algorithm utilizes total main
memory available more eectively than the CD algorithm.
0.1
0.06
Figure 11: Response time on 16 processors with 50K transactions as the minimum support varies. At each support
level, the total number of candidate item-sets is shown in
parenthesis
count
intelligent data
hybrid
simple hybrid
1200
1000
0.25
Minimum support (%)
800
600
(5232 K)
400
(2408 K)
200
(1083 K)
(345 K)
(211 K)
0.5
0.25
Minimum support (%)
0.1
0.060.04
Figure 12: Response time on 64 processors with 50K transactions as the minimum support varies. At each support
level, the total number of candidate item-sets is shown in
parenthesis
Number of processors
1
2
4
8
16 32 64
Successful down to
0.25 0.2 0.15 0.1 0.06 0.04 0.03
Ran out of memory at 0.2 0.15 0.1 0.06 0.04 0.03 0.02
Table 1: Minimum support (%) reachable with dierent number of processors in our algorithms.
Pass
2
3
4
5
6
7
8
9
10
Conguration 8 8 64 1 4 16 2 32 2 32 2 32 2 32 2 32 1 64
No of Cand. 351K 4348K 115K 76K
56K
34K 16K
6K
2K
Table 2: Processor conguration and number of candidates of the HD algorithm with 64 processors and 0.04% minimum
support for each pass. Note that 64 1 conguration is the same as the DD algorithm and 1 64 is the same as the CD
algorithm. The total number of pass was 13 and all passes after 9 had 1 64 conguration.
This algorithms improves over the DD algorithm which has
high communication overhead and redundant work. The
communication overhead was reduced using a better data
movement communication mechanism, and redundant work
was reduced by partitioning the candidate set intelligently
and using bit maps to prune away unnecessary computation.
However, as the number of processors available increases, the
eciency of this algorithm decreases unless the amount of
work is increased by having more number of candidates.
The HD combines advantages of the CD and IDD. This
algorithm partitions candidate sets just like the IDD to exploit the aggregate main memory, but dynamically determines the number of partitions such that the partitioned
candidate set ts into the main memory of each processor
and each processor has enough number of candidates for
computation. It also exploits the advantage of the CD by
just exchanging counts information and moving around the
minimum number of transactions among the smaller subset
of processors.
The experimental results on a 128-processor Cray T3D
parallel machine show that the HD algorithm scales just as
well as the CD algorithm with respect to the number of
transactions. It also exploits the aggregate main memory
better and thus is able to nd out more association rules
with much smaller minimum support with a single scan of
database per pass. The IDD algorithm also outperforms the
DD algorithm, but is not as scalable as HD and CD.
Future works include applying these algorithms to real
data like retail sales transaction, mail order history database
and World Wide Web server logs [MJHS96] to conrm the
experimental results in the real life domain. We plan to
perform experiments on dierent platforms including Cray
T3E, IBM SP2 and SGI SMP clusters. We also plan on implementing our ideas in generalized association rules [HF95,
SA95], and sequential patterns [MTV95, SA96].
References
[AIS93] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items
in large databases. In Proc. of 1993 ACMSIGMOD Int. Conf. on Management of Data,
Washington, D.C., 1993.
[AS94]
R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. In Proc. of the
[AS96]
[HF95]
[HKK97]
[HS95]
[KGGK94]
[MJHS96]
[MTV95]
[Pro96]
[PS82]
[SA95]