0% found this document useful (0 votes)
34 views10 pages

Dic Paper

The document proposes a new algorithm called Dynamic Itemset Counting (DIC) to more efficiently find frequent itemsets in market basket data. It improves on the classic Apriori algorithm by counting itemsets dynamically as it makes fewer passes over the data. The document also introduces implication rules based on conviction as an alternative to traditional association rules based on confidence, arguing conviction provides more intuitive results. Finally, it applies these techniques to analyze real-world census data.

Uploaded by

OdysseY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views10 pages

Dic Paper

The document proposes a new algorithm called Dynamic Itemset Counting (DIC) to more efficiently find frequent itemsets in market basket data. It improves on the classic Apriori algorithm by counting itemsets dynamically as it makes fewer passes over the data. The document also introduces implication rules based on conviction as an alternative to traditional association rules based on confidence, arguing conviction provides more intuitive results. Finally, it applies these techniques to analyze real-world census data.

Uploaded by

OdysseY
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Dynamic Itemset Counting and Implication Rules

for Market Basket Data


Sergey Brin  Rajeev Motwani y Je rey D. Ullman z Shalom Tsur
Department of Computer Science R&D Division, Hitachi America Ltd.
Stanford University [email protected]
fsergey,rajeev,[email protected]

Abstract checkout. Determining what products customers are likely


to buy together can be very useful for planning and market-
We consider the problem of analyzing market-basket data ing. However, there are many other applications which have
and present several important contributions. First, we present varied data characteristics. For example, student enrollment
a new algorithm for nding large itemsets which uses fewer in classes, word occurrence in text documents, users' visits
passes over the data than classic algorithms, and yet uses of web pages, and many more. We applied market-basket
fewer candidate itemsets than methods based on sampling. analysis to census data (see section 5).
We investigate the idea of item reordering, which can im- In this paper, we address both performance and func-
prove the low-level eciency of the algorithm. Second, we tionality issues of market-basket analysis. We improve per-
present a new way of generating \implication rules," which formance over past methods by introducing a new algorithm
are normalized based on both the antecedent and the con- for nding large itemsets (an important subproblem). We
sequent and are truly implications (not simply a measure enhance functionality by introducing implication rules as an
of co-occurrence), and we show how they produce more in- alternative to association rules (see below).
tuitive results than other methods. Finally, we show how One very common formalization of this problem is nd-
di erent characteristics of real data, as opposed to synthetic ing association rules which are based on support and con-
data, can dramatically a ect the performance of the system dence. The support of an itemset (a set of items), I , is
and the form of the results. the fraction of transactions the itemset occurs in (is a sub-
set of). An itemset is called large if its support exceeds a
1 Introduction given threshold, . An association rule is written I ! J
where I and J are itemsets 1 . The con dence of this rule is
Within the area of data mining, the problem of deriving as- the fraction of transactions containing I that also contain J .
sociations from data has recently received a great deal of For the association rule, I ! J to hold, I [ J must be large
attention. The problem was rst formulated by Agrawal et and the con dence of the rule must exceed a given con d-
al, [AIS93a, AIS93b, AS94, AS95, ALSS95, SA95, MAR96, ence threshold, . In probability terms, we can write this
Toi96] and is often referred to as the \market-basket" prob- P (fI [ J g) >  and P (J jI ) > .
lem. In this problem, we are given a set of items and a The existing methods for deriving the rules consist of two
large collection of transactions which are subsets (baskets) steps:
of these items. The task is to nd relationships between the 1. Find the large itemsets for a given .
presence of various items within those baskets.
There are numerous applications of data mining which 2. Construct rules which exceed the con dence threshold
t into this framework. The canonical example from which from the large itemsets in step 1. For example, if ABC
the problem gets its name is a supermarket. The items is a large itemset we might check the con dence of
are products and the baskets are customer purchases at the AB ! C , AC ! B and BC ! A.
 Work done at R&D Division of Hitachi America Ltd. Also sup- In this paper we address both of these tasks, step 1 from
ported by a fellowship from the NSF. a performance perspective by devising a new algorithm, and
y Supported by an Alfred P. Sloan Research Fellowship, an IBM step 2 from a semantic perspective by developing conviction,
Faculty Partnership Award, an ARO MURI Grant DAAH04-96-1- an alternative to con dence.
0007, and NSF Young Investigator Award CCR-9357849, with match-
ing funds from IBM, Mitsubishi, Schlumberger Foundation, Shell
Foundation, and Xerox Corporation.
z Supported by a grant from IBM, a gift from Hitachi, and MITRE
1.1 Algorithms for Finding Large Itemsets
agreement number 21263. Much research has focussed on deriving ecient algorithms
for nding large itemsets (step 1). The most well-known al-
gorithm is Apriori [AIS93b, AS94] which, as all algorithms
for nding large itemsets, relies on the property that an item-
set can only be large if and only if all of its subsets are large.
It proceeds level-wise. First it counts all the 1-itemsets2
1 J is typically restricted to just one item though it doesn't have
to be.
2 A k-itemset is an itemset with k items.
and nds counts which exceed the threshold - the large 1- transactions, we will nish counting the 2-itemsets and after
itemsets. Then it combines those to form candidate (poten- 20,000 transactions, we will nish counting the 3-itemsets.
tially large) 2-itemsets, counts them and determines which In total, we have made 1.5 passes over the data3 instead of
are the large 2-itemsets. It continues by combining the large the 3 passes a level-wise algorithm would make.
2-itemsets to form candidate 3-itemsets, counting them and DIC addresses the high-level issues of when to count
determining which are the large 3-itemsets and so forth. which itemsets and is a substantial speedup over Apriori,
Let Lk be the set of large k-itemsets. For example, particularly when Apriori requires many passes. We deal
L3 might contain ffA; B; C g; fA; B; Dg; fA; D; F g; : : :g. Let with the low-level issue of how to increment the appropriate
Ck be the set of candidate k-itemsets; this is always a super- counters for each transaction in Section 3 by considering the
set of Lk . Here is the algorithm: sort order of items in our data structure.
Result := ;; 1.2 Implication Rules
k := 1;
C1 = set of all 1-itemsets; Our contribution to functionality in market basket analysis
while Ck 6= ; do is implication rules based on conviction, which we believe
create a counter for each itemset in Ck ; is a more useful and intuitive measure than con dence and
forall transactions in database do interest (see discussion in Section 4). Unlike con dence,
Increment the counters of itemsets in Ck conviction is normalized based on both the antecedent and
which occur in the transaction; the consequent of the rule like the statistical notion of cor-
Lk := All candidates in Ck relation. Furthermore, unlike interest, it is directional and
which exceed the support threshold; measures actual implication as opposed to co-occurrence.
Result := Result [ Lk ; Because of these two features, implication rules can pro-
Ck+1 := all k + 1-itemsets duce useful and intuitive results on a wide variety of data.
which have all of their k-item subsets in Lk . For example, the rule past active duty in military ) no ser-
k := k + 1; vice in Vietnam has a very high con dence of 0.9. Yet it is
end clearly misleading since having past military service only in-
creases the chances of having served in Vietnam. In tests on
census data, the advantages of conviction over rules based
Thus, the algorithm performs as many passes over the on con dence or interest are evident.
data as the maximum number of elements in a candidate In Section 5, we present the results of generating implica-
itemset, checking at pass k the support for each of the can- tion rules for U.S. census data from the 1990 census. Census
didates in Ck . The two important factors which govern per- data is considerably more dicult to mine than supermar-
formance are the number of passes made over all the data ket data and the performance advantages of DIC for nding
and the eciency of those passes. large itemsets are particularly useful.
To address both of those issues we introduce Dynamic
Itemset Counting (DIC), an algorithm which reduces the 2 Counting Large Itemsets
number of passes made over the data while keeping the num-
ber of itemsets which are counted in any pass relatively low Itemsets form a large lattice with the empty itemset at the
as compared to methods based on sampling [Toi96]. The in- bottom and the set of all items at the top (see example,
tuition behind DIC is that it works like a train running over gure 2). Some itemsets are large (denoted by boxes), and
the data with stops at intervals M transactions apart. (M the rest are small. Thus, in the example, the empty itemset,
is a parameter; in our experiments we tried values ranging A; B; C; D; AB;AC; BC; BD; CD; ABC are large.
from 100 to 10,000.) When the train reaches the end of the To show that the itemsets are large we can count them.
transaction le, it has made one pass over the data and it In fact, we must, since we generally want to know the counts.
starts over at the beginning for the next pass. The \passen- However, it is infeasible to count all of the small itemsets.
gers" on the train are itemsets. When an itemset is on the Fortunately, it is sucient to count just the minimal ones
train, we count its occurrence in the transactions that are (the itemsets that do not include any other small itemsets)
read. since if an itemset is small, all of its supersets are small too.
If we consider Apriori in this metaphor, all itemsets must The minimal small itemsets are denoted by circles; in our
get on at the start of a pass and get o at the end. The 1- example AD and BCD are minimal small. They form the
itemsets take the rst pass, the 2-itemsets take the second top side of the boundary between the large and small itemsets
pass, and so on (see Figure 1). In DIC, we have the added (Toivonen calls this the negative boundary; in lattice theory
exibility of allowing itemsets to get on at any stop as long the minimal small itemsets are called the prime implicants).
as they get o at the same stop the next time the train goes An algorithm which counts all the large itemsets must
around. Therefore, the itemset has \seen" all the transac- nd and count all of the large itemsets and the minimal small
tions in the le. This means that we can start counting an itemsets (that is, all of the boxes and circles). The DIC
itemset as soon as we suspect it may be necessary to count algorithm, described here, marks itemsets in four di erent
it instead of waiting until the end of the previous pass. possible ways:
For example, if we are mining 40,000 transactions and
M = 10; 000, we will count all the 1-itemsets in the rst  Solid box - con rmed large itemset - an itemset we have
40,000 transactions we will read. However, we will begin nished counting that exceeds the support threshold.
counting 2-itemsets after the rst 10,000 transactions have 3 Of course in this example we assumed best of all possible circum-
been read. We will begin counting 3-itemsets after 20,000 stances where we estimated correctly exactly which 2 and 3-itemsets
transactions. For now, we assume there are no 4-itemsets we would have to count. More realistically, some itemsets would be ad-
we need to count. Once we get to the end of the le, we ded a little later. Nonetheless there would still be considerable savings.
will stop counting the 1-itemsets and go back to the start of
the le to count the 2 and 3-itemsets. After the rst 10,000
Apriori DIC

2-itemsets
Transactions

Transactions
3-itemsets
1-itemsets

2-itemsets

1-itemsets

3-itemsets
3 passes 1.5 passes
Figure 1: Apriori and DIC

 Solid circle - con rmed small itemset - an itemset we


have nished counting that is below the support threshold.
ABCD  Dashed box - suspected large itemset - an itemset we
are still counting that exceeds the support threshold.
 Dashed circle - suspected small itemset - an itemset we
are still counting that is below the support threshold.
ABC ABD ACD BCD The DIC algorithm works as follows:
1. The empty itemset is marked with a solid box. All the
1-itemsets are marked with dashed circles. All other
itemsets are unmarked. (See Figure 3.)
AB AC BC AD BD CD 2. Read M transactions. We experimented with values
of M ranging from 100 to 10,000. For each transac-
tion, increment the respective counters for the itemsets
marked with dashes. See section 3.
A B C D 3. If a dashed circle has a count that exceeds the support
threshold, turn it into a dashed square. If any imme-
diate superset of it has all of its subsets as solid or
dashed squares, add a new counter for it and make it
a dashed circle. (See Figures 4 and 5.)
{} 4. If a dashed itemset has been counted through all the
transactions, make it solid and stop counting it.
Figure 2: An itemsets lattice. 5. If we are at the end of the transaction le, rewind to
the beginning. (See Figure 6.)
ABCD ABCD

ABC ABD ACD BCD ABC ABD ACD BCD

AB AC BC AD BD CD AB AC BC AD BD CD

A B C D A B C D

{} {}
Figure 3: Start of DIC algorithm. Figure 4: After M transactions.

ABCD ABCD

ABC ABD ACD BCD ABC ABD ACD BCD

AB AC BC AD BD CD AB AC BC AD BD CD

A B C D A B C D

{} {}
Figure 5: After 2M transactions. Figure 6: After one pass.

6. If any dashed itemsets remain, go to step 2. itemsets which occur in the transaction. This must be
very fast as it is the bottleneck of the whole process.
This way DIC starts counting just the 1-itemsets and We attempt to optimize this operation in Section 3.
then quickly adds counters 2,3,4,:: : ,k-itemsets. After just
a few passes over the data (usually less than two for small 3. Maintain itemset states by managing transitions from
values of M ) it nishes counting all the itemsets. Ideally, active to counted (dashed to solid) and from small to
we would want M to be as small as possible so we can start large (circle to square). Detect when these transitions
counting itemsets very early in step 3. However, steps 3 and should occur.
4 incur considerable overhead so we do not reduce M below
100. 4. When itemsets do become large, determine what new
itemsets should be added as dashed circles since they
2.1 The Data Structure could now potentially be large.
The implementation of the DIC algorithm requires a data The data structure used for this is exactly like the hash
structure which can keep track of many itemsets. In partic- tree used in Apriori with a little extra information stored at
ular, it must support the following operations: each node. It is a trie with the following properties. Each
itemset is sorted by its items (the sort order is discussed in
1. Add new itemsets. Section 3). Every itemset we are counting or have counted
has a node associated with it, as do all of its pre xes. The
2. Maintain a counter for every itemset. When transac- empty itemset is the root node. All the 1-itemsets are at-
tions are read, increment the counters of those active tached to the root node, and their branches are labeled by the
the necessary itemsets over the whole data in just one pass.
However, this pays the added penalty of having to count
more itemsets due to the reduced threshold. This can be
{} A B C quite costly, particularly for datasets like the census data
(see section 5). Instead of being conservative, our algorithm
bravely marches on, on the assumption that it4 will later come
back to anything missed with little penalty.
Besides performance, DIC provides considerable exibil-
C ity by having the ability to add and delete counted itemsets
on the y. As a result, DIC can be extended to parallel and
incremental update versions (see section 6.1.1).
2.3 Non-homogeneous Data
D One weakness of DIC is that it is sensitive to how homogen-
eous the data is. In particular, if the data is very correlated,
we may not realize that an itemset is actually large until
we have counted it in most of the database. If this hap-
pens, then we will not shift our hypothetical boundary and
B C D start counting some of the itemset's supersets until we have
almost nished counting the itemset. As it turns out, the
census data we used is ordered by census district and ex-
actly this problem occurs. To test the impact of this e ect,
we randomized the order of the transactions and re-ran DIC.
D It turned out to make a signi cant di erence in performance
(see Section 5). The cost associated with randomizing trans-
action order is small compared to the mining cost.
However, randomization may be impractical. For ex-
ample, it may be expensive, the data may be stored on tape,
or there might be insucient space to store the random-
C D ized version. We considered several ways of addressing this
problem:
 Virtually randomize the data. That is, visit the le in
a random order while making sure that every pass is
D in the same order. This can incur a high seek cost,
especially if the data is on tape. In this case, it may be
sucient to jump to a new location every few thousand
Figure 7: Hash Tree Data Structure transactions or so.
 Slacken the support threshold. First, start with a sup-
port threshold considerably lower than the given one.
item they represent. All other itemsets are attached to their Then, gradually increase the threshold to the desired
pre x containing all but their last item. They are labeled by level. This way, the algorithm begins fairly conservat-
that last item. Figure 7 shows a sample trie of this form. The ive and then becomes more con dent as more data is
dotted path represents the traversal which is made through collected. We experimented with this technique some-
the trie when the transaction ABC is encountered so A, AB, what but with little success. However, perhaps more
ABC, AC, B, BC, and C must be incremented and they are, careful control of the slack or a di erent dataset would
in that order. The exact algorithm for this is described in make this a useful technique.
Section 3.
Every node stores the last item in the itemset it repres-  One thing to note is that if the data is correlated with
ents, a counter, a marker as to where in the le we started its location in the le, it may be useful to detect this
counting it, its state, and its branches if it is an interior node. and report it. This is possible if a \local" counter is
kept along with each itemset which measures the count
2.2 Signi cance of DIC of the current interval. At the end of each interval it
can be checked for considerable discrepancies with its
There are a number of bene ts to DIC. The main one is overall support in the whole data set.
performance. If the data is fairly homogeneous throughout
the le and the interval M is reasonably small, this algorithm The DIC algorithm addresses the high-level strategy of
generally makes on the order of two passes. This makes the what itemsets to count when. There are also lower level
algorithm considerably faster than Apriori which must make performance issues as to how to increment the appropriate
as many passes as the maximum size of a candidate itemset. counters for a particular transaction. We address these in
If the data is not fairly homogeneous, we can run through it section 3.
in a random order (section 2.3). 4 We did not have time to implement and test Toivonen's algorithm
Some important relevant work was done by Toivonen as compared to ours. However, based on tests with lowered support
using sampling [Toi96]. His technique was to sample the thresholds, we suspect that DIC is quite competitive.
data using a reduced threshold for safety, and then count
; 7 cost of the insert is 31 (see Table 3). However, if we change
A 6 the order of the items to EFGABCD (note the tree structure
AB 5 remains the same since A,B,C,D did not change order), the
B 5 cost becomes 16 (see Table 3).
BC 4 This is considerably cheaper. Therefore what we want is
C 4 to order the items by the inverse of their popularity in the
totala 31 counted non-leaf itemsets. A reasonable approximation for
this inverse is the inverse of their popularity in the rst M
Table 1: Increment Cost for ABCDEFG transactions. Since during the rst interval of transactions
we are counting only 1-itemsets, there is not yet a tree struc-
a ABC, AC, AD, BCD, BD, CD, D, E, F, and G ture which depends on the order. After the rst M trans-
cost 0 since they are leaves. actions, we change the order of the items and build the tree
; 7 from there. Future transactions must be resorted according
A 3 to the new ordering. This technique incurs some overhead
AB 2 due to the re-sorting, but for some data it can be bene cial
B 2 overall.
BC 1
C 1 4 Implication rules
total 16
Some traditional measures of \interestingness" have been
Table 2: Increment Cost for EFGABCD support combined with either con dence or interest. Con-
sider these measures from a probabilistic model.
Let fA; B g be an itemset. Then the support is P (fA; B g)
3 Item Reordering which we write P (A; B ). This is used to make sure that the
items this rule applies to actually occur frequently enough
The low-level problem of how to increment the appropriate for someone to care. It also makes the task computationally
counters for a given transaction is an interesting one in itself. feasible by limiting the size of the result set, and is usually
Recall that the data structure we use is a trie structure much used in conjunction with other measures. The con dence of
like that used in Apriori (see section 2.1). Given a collection A ) B is P (B j A), the conditional probability of B given
of itemsets, the form of this structure is heavily dependent A, which is equal to P (A;B )=P (A). It has the aw that
on the sort order of the items. Note that in our sample data it ignores P (B ). For example, P (A;B )=P (A) could equal
structure ( gure 7), the order of the items was A,B,C,D. P (B ) (i.e. the occurrence of B is unrelated to A) and could
Because of this, A occurs only once in the trie while D occurs still be high enough to make the rule hold. For example,
ve times. if people buy milk 80% of the time in a supermarket and
To determine how to optimize the order of the items, it the purchase of milk is completely unrelated to the purchase
is important to understand how the counter incrementing of smoked salmon, then the con dence of salmon ) milk is
process works. We are given a transaction, S (with items still 80%. This con dence is quite high, and therefore would
S [0] : : : S [n]), in a certain order. To increment the appropri- generate a rule. This is a key weakness of con dence, and
ate counters we do the following, starting at the root node of is particularly evident in census data, where many items are
the trie T : very likely to occur with or without other items.
The interest of A; B is de ned as P (A; B )=P (A)P (B ) and
Increment(T,S) f factors in both P (A) and P (B ); essentially it is a measure
/* Increment this node's counter */ of departure from independence. However, it only measures
T.counter++; co-occurrence not implication, in that it is completely sym-
If T is not a leaf then forall i, 0  i  n: metric.
/* Increment branches as necessary */ To ll the gap, we de ne conviction as P (A)P (:B )=P (A; :B ).
If T.branches[S[i]] exists: The intuition as to why this is useful is: logically, A ! B
Then Increment(T.branches[S[i]], S[i+1..n]) can be rewritten as :(A ^ :B ) so we see how far A ^ :B
Return. g deviates from independence, and invert the ratio to take care
of the outside negation 5 . We believe this concept is useful
X
Therefore, the cost of running this subroutine is equal to:
n ? Index(Last(I );S )
for a number of reasons:
 Unlike con dence, conviction factors in both P (A) and
P (B ) and always has a value of 1 when the relev-
I
ant items are completely unrelated like the salmon and
where I ranges over non-leaf itemsets in T which occur in milk example above.
S , and n ? Index(Last(I );S ) is the number of items left in  Unlike interest, rules which hold 100% of the time, like
S after the last element of I . These items will be checked Vietnam veteran ) more than ve years old have the
in the inner loop. Therefore, it is advantageous to have the highest possible conviction value of 1. Con dence
items which occur in many itemsets to be last in the sort also has this property in that these rules have a con d-
order of the items (so few items will be left after them) and ence of 1. However, interest does not have this useful
the items which occur in few itemsets to be rst. property. For example, if 5% of people are Vietnam
For example, consider the structure in gure 7. Suppose
there are also items E,F, and G, and we add their respective 5 In practice we do not invert the ratio; instead we search for low
1-itemsets to the data structure. There will now be three values of the uninverted ratio. This way we do not have to deal with
singletons hanging o the tree. If we insert ABCDEFG, the in nities.
veterans and 90% are more than ve years old, we 300
get interest = 0:05=(0:05)  0:09 = 1:11 which is only
slightly above 1 (the interest for completely independ- 250 Apriori
ent items).

Execution Time (sec)


DIC
200
In short, conviction is truly a measure of implication be-
cause it is directional, it is maximal for perfect implications, 150
and it properly takes into account both P (A) and P (B ).
100
5 Results
We tested the DIC algorithm along with reordering on two 50
di erent types of data: synthetic data generated by the IBM
test data generator 6 and U.S. census data. We also tested 0
implication rules on both sets of data but the results of these 0.02 0.018 0.016 0.014 0.012 0.01 0.008 0.006
tests are only interesting on the census data. This is because Support Threshhold
the synthetic data is designed to test association rules based
on support and con dence. Also, the rules generated are
only interesting and can be evaluated for utility if the items Figure 8: Performance of Apriori and DIC on Synthetic Data
involved have some actual meaning. Overall, the results of
tests on both sets of data justi ed DIC, and tests on the
census data justi ed implication rules. 4500
5.1 Test Data 4000 Apriori
3500 DIC

Execution Time (sec)


The synthetic test data generator is well documented in [AS94] Randomized DIC
and it was very convenient to use. We used 100,000 trans- 3000
actions, with an average size of 20 items chosen from 1000 2500
items, and average large itemsets were of size 4.
The census data was a bit more challenging. We chose to 2000
look at PUMS les which are Public Use Microdata Samples. 1500
They contain actual census entries which constitute a ve 1000
percent sample of the state the le represents. In our tests
we used the PUMS le for Washington D.C. which consists 500
of roughly 30,000 entries. Each entry has 127 attributes, 0
each of which is represented as a decimal number. For ex- 0.5 0.48 0.46 0.44 0.42 0.4 0.38 0.36
ample, the SEX eld uses one digit - 0 for male and 1 for Support Threshhold
female. The INCOME eld uses six digits - the actual dollar
value of that person's income for the year. We selected 73 of
these attributes to study and took all possible ( eld,value)
pairs. For the numerical attributes, like INCOME, we took Figure 9: Performance of Apriori and DIC on Census Data
the logarithm of the value and rounded it to the 7nearest in-
teger to reduce the number of possible answers . In total
this yielded 2166 di erent items.
This data has several important di erences from the syn- not M . Note that much work has been done on Apriori in
thetic data. First, it is considerably wider - 73 versus 20 recent years to optimize it and it was impossible for us to
items per transaction. Second, many of the items are ex- perform all of those optimizations. However, almost all of
tremely popular, such as worked in 1989. Third, the en- these optimizations are equally applicable to DIC.
tire itemset structure is far more complex than that of syn-
thetic data because correlations can be directional (for ex- 5.3 Relative Performance of DIC and Apriori
ample given birth to 2 children ) Female but the reverse Both DIC and Apriori were run on the synthetic data and
is not true), many rules are true 100% of the time (same census data. Running both algorithms on the synthetic data
example), and almost all attributes are correlated to some was fairly straightforward. We tried a range of support val-
degree. These factors make mining census data considerably ues and produced large itemsets relatively easily ( gure 8).
more dicult than supermarket style data. Apriori beat out DIC by about 30% on the high support
end of the graph but DIC outperformed Apriori in most
5.2 Test Implementations tests, running 30% faster at a support threshold of 0.005.
We implemented DIC and Apriori in C++ on several di er- However, running both DIC and Apriori on census data was
ent Unix platforms. The implementations were mostly the tricky. This is because a number of items in the census data
same since Apriori can be thought of as a special case of DIC appeared over 95% of the time and therefore there were a
- the case where the interval size, was the size of the data le, huge number of large itemsets. To address this problem,
items with over 80% support were dropped. There were still
6 http://www.almaden.ibm.com/cs/quest/syndata.html
a fair number of large itemsets but manageable at very high
7 There has been much work done on bucketizing numerical para- support thresholds. The reader will notice that the tests
meters. However this is not the focus of our research so we took a were run on support levels between 36% and 50% which are
simple approach.
more than an order of magnitude higher than support levels
1800 1400
1600 M = 10; 000 1200 With Reordering
1400 M = 1; 000 Without Reordering
Execution Time (sec)

Execution Time (sec)


M = 300 1000
1200 M = 100
1000 800
800 600
600 400
400
200 200
0 0
0.5 0.48 0.46 0.44 0.42 0.4 0.38 0.36 0.5 0.48 0.46 0.44 0.42 0.4 0.38 0.36
Support Threshhold Support Threshhold

Figure 10: E ect of Varying Interval Size on Performance Figure 11: Performance With and Without Item Reordering

used for supermarket analysis. Otherwise, far too many We also tried varying M for non-randomized data. These
large itemsets would be generated. Even at the 36% support experiments failed miserably and we were not able to com-
threshold, mining proved time consuming, taking nearly half plete them. In terms of the number of passes, at a sup-
an hour on just 30,000 records. port threshold of 0.36, Apriori made 10 passes over the data;
simple DIC made 9 with an M of 10,000; randomized DIC
5.3.1 Performance on the Census Data made 4, 2.1, 1.3, and 1.3 passes for values of M of 10000,
1000, 300, and 100 respectively. This shows that DIC when
There are several reasons why mining the census data is combined with randomization and a suciently low M , does
so much more dicult than the synthetic data. The census indeed nish in a very small number of passes.
data is 3.5 times wider than the synthetic data. So if we
were counting all 2-itemsets, it would take 12 times longer 5.5 E ect of Item Reordering
per transaction (there are 12 times as many pairs in each
row of the census data). If we were counting all 3-itemsets Item reordering was not nearly as successful as we had hoped.
it would take 40 times longer; 4-itemsets would take 150 It made a small di erence in some tests but overall played
times longer. Of course we are not counting all the 2,3, or 4- a negligible role in performance. In tests on census data it
itemsets; however, we are counting many of them, and we are made less than 10% di erence, sometimes in the wrong dir-
counting higher cardinality itemsets as well. Furthermore, ection (Figure 11). This was something of a disappointment,
even after taking out the items which have more than 80% but perhaps a better analysis of what the optimal order is
support, we are still left with many popular items, such as and on-the- y modi cation will yield better results.
works 40 hours/week. These tend to combine to form many
long itemsets. 5.6 Tests of Implication Rules
The performance graphs show three curves ( gure 9).
One for Apriori, one for DIC, and one for DIC when we It is very dicult to quantify how well implication rules
shued the order of the transactions beforehand (this has work. Due to the high support threshold, we considered
no e ect on Apriori). For both tests with DIC, M was rules based on the minimal small itemsets as well as the
10,000. The results clearly showed that DIC runs notice- large itemsets. In total there were 23712 rules with con-
ably faster than Apriori and randomized DIC runs noticeably viction > 1:25 of which 6732 had a conviction of 1. From
faster than DIC. For the support level of 0.36, randomized these, we learned that ve year olds don't work, unemployed
DIC ran 3.2 times faster than Apriori. By varying the value residents don't earn income from work, men don't give birth,
of M , we achieved slightly higher speedups - 3.3 times faster and many other interesting facts. Looking down the list to
for support of 0.36 and 3.7 times faster for 0.38 (see next a conviction level of 50, we nd that those who are not in
section). the military, are not looking for work, and had work this
year (1990, the year of the census), are currently employed
5.4 Varying the Interval Size as civilians. We list some sample rules in Table 3. Note that
one problem was that many rules were very long (involving
One experiment we tried was to nd the optimal value of say seven items) and were too complicated to be interesting.
M , the interval size (Figure 10). We tried values of 100, Therefore, we list some of the shorter ones.
300, 1000, and 10000 for M . The values in the middle of By comparison, tests with con dence produced some mis-
the range, 300 and 1000, worked the best, coming in second leading results. For example, the con dence of women who
and rst respectively. An interval size of 100 proved the do not state whether they are looking for a job do not have
worst choice due to too much overhead incurred. A value of personal care limitations was 73% which is at the high end
10,000 was somewhat slow because it took more passes over of the scale. However, it turned out that this was simply
the data than for lower interval values. because 76% of all respondents do not have personal care
conviction implication rule
1 ve year olds don't work
1 unemployed people don't earn income from work
1 men don't give birth
people who are not in the military and are not looking for work
50 and had work this year (1990, the year of the census) currently
have civilian employment
10 people who are not in the military and who worked last week
are not limited in their work by a disability
2.94 heads of household do not have personal care limitations
1.5 people not in school and without personal care limitations have
worked this year
1.4 African-American women are not in the military
1.28 African-Americans reside in the same state they were born
1.28 unmarried people have moved in the past ve years
Table 3: Sample Implication Rules From Census Data

limitations. new potentially large itemsets (new prime implicants) which


Interest also produced less useful results. For example, we must count over the entire data, not just the update.
the interest of male and never given birth is 1.83 which is Therefore, when we determine that a new itemset must be
considerably lower than very many itemsets which we would counted, we must go back and count it over the pre x of the
consider less related and appears 40% of the way down in data that we missed. This is very much like the way DIC
the list of rules with interest greater than 1.25. goes back to count the pre xes of itemsets it missed and it
is straightforward to extend DIC in this way.
6 Conclusions Another consideration is whether it is more useful to nd
the large itemsets over all the data or mine just the recent
6.1 Finding Large Itemsets data (perhaps on the order of several months which may cor-
respond to many small updates). Recall the train analogy
We found that the DIC algorithm, particularly when com- (Section 1.1). The solution is to have two trains, one reading
bined with randomization provided a signi cant perform- current data as it is coming in and incrementing appropriate
ance boost for nding large itemsets. Item reordering did counts, and the other reading several months old data and
not work as well as we had hoped. However in some isolated decrementing the appropriate counts to remove its e ects.
earlier tests it seemed to make a big di erence. We suspect In order to do this, it is necessary to be able to add and re-
that a di erent method for determining the item ordering move itemset counters on the y, quickly and eciently, like
might make this technique useful. Selecting the interval M DIC handles static data. This extension may be particularly
made a big di erence in performance and warrants more useful.
investigation. In particular, we may consider a varying in-
terval depending on how many itemsets were added at the 6.1.3 Census Data
last checkpoint.
There are a number of possible extensions to DIC. Be- The census data was particularly challenging. Market basket
cause of its dynamic nature, it is very exible and can be analysis techniques had not been designed to deal well with
adapted to parallel and incremental mining. this kind of data. It was dicult for several reasons (see
section 5.1) - the data was very wide (more than 70 items
6.1.1 Parallelism per transaction), items were very varied in support (from
very close to 0% to very close to 100%), and there was a
The most ecient known way to parallelize nding large lot to mine (many things were highly correlated). It was
itemsets has been to divide the database among the nodes much more dicult to mine than supermarket data which is
and to have each node count all the itemsets for its own data much more uniform in many ways. We believe that many
segment. Two key performance issues are load balancing and other data sets are similarly challenging to mine and more
synchronization. Using Apriori, it is necessary to wait after work should be done toward handling them eciently. It
each pass to get the results from all nodes to determine what may be useful to develop some overall measures of diculty
the new candidate sets are for the next pass. Since DIC can for market-basket data sets.
dynamically incorporate new itemsets to be added, it is not
necessary to wait. Nodes can proceed to count the itemsets 6.2 Implication Rules
they suspect are candidates and make adjustments as they
get more results from other nodes. Looking over the implication rules generated on census data
was educational. First, it was educational because most of
6.1.2 Incremental Updates the rules themselves were not. The rules that came out at the
top, were things that were obvious. Perhaps the interesting
Handling incremental updates involves two things: detecting things about the rules with a very high conviction value is
when a large itemset becomes small and detecting when a why those that are very high are not 1. For example, who
small itemset becomes large. It is the latter that is more are the seven people who earned over $160,000 last year but
dicult. If a small itemset becomes large, we may now have are less than 500% over the poverty line?
The most interesting rules were found in the middle of Acknowledgements
the range, not extremely high as to be obvious (anything
over 5) but not so low as to be insigni cant (around 1.01). We would like to thank members of the Stanford Data Min-
We believe that this is generally true of any data mining ing Group for helpful discussions. Also, we wish to thank
application. The extremely correlated e ects are generally Hitachi America's new Information Technology Group for
well known and obvious. The truly interesting ones are far supporting this work. Finally, we are grateful to Mema Rous-
less correlated. sopoulos for much help.
A big problem was the number of rules that were gener-
ated - over 20,000. It is both impossible and unnecessary to References
deal with so many rules. We have considered several tech-
niques for pruning them. [AIS93a] R. Agrawal, T. Imilienski, and A. Swami. Data-
base Mining: A Performance Perspective. IEEE
 First, one can prune rules which are not minimal. For Transactions on Knowledge and Data Engineer-
example, if we have A; B ) C but A ) C then we may ing, 5(6):914{925, December 1993.
prune the rst rule. This is somewhat nontrivial in that
the longer rule may hold with a higher conviction value [AIS93b] R. Agrawal, T. Imilienski, and A. Swami. Min-
and therefore we may want to keep it. We have imple- ing Association Rules between Sets of Items in
mented pruning of all rules which have subsets with at Large Databases. Proc. of the ACM SIGMOD
least as high a conviction value. This has proven quite Int'l Conf. on Management of Data, pages 207{
e ective and and in tests it cuts down the number of 216, May 1993.
rules generated by more than a factor of 5. The rules
which are pruned are typically long and can be mis- [ALSS95] R. Agrawal, K. Lin, S. Sawhney, and K. Shim.
leading. An example of a pruned rule is an employed Fast similarity search in the presence of noise,
civilian who had work in 1989, is not looking for a job, scaling and translation in time-series databases.
is not on a leave of absence, is caucasian, and whose In Proc. of the Int'l Conf. on Very Large Data
primary language is english, has worked this year. It Bases (VLDB), 1995.
is implied by the rule an employed civilian has worked [AS94] R. Agrawal and R. Srikant. Fast algorithms for
this year. This kind of pruning is very e ective and mining association rules. In Proceedings of the
can produce concise output for the user. 20th VLDB Conference, Santiago, Chile, 1994.
 Second, one can prune transitively implied rules. For [AS95] R. Agrawal and R. Srikant. Mining sequential
rules that hold 100% of the time, if we have A ) B patterns. In Proceedings of the 11th Interna-
and B ) C , one may want to prune A ) C . However, tional Conmference on Data Engineering, Taipei,
there are several diculties. First, which minimal set Taiwan, 1995.
of rules should one pick? There can be many. And
second, how should the rules which don't hold 100% of [MAR96] M. Mehta, R. Agrawal, and J. Rissanen. Sliq:
the time be handled? That is, how should the convic- A fast scalable classi er for data mining. March
tion value be expected to carry through? 1996.
Overall, conviction has proven to be a useful new measure [SA95] R. Srikant and R. Agrawal. Mining generalized
having the bene ts of being 1 for perfect rules and 1 for association rules. 1995.
completely uncorrelated rules. Moreover, it generally ranks [Toi96] H. Toivonen. Sampling large databases for associ-
rules in a reasonable and intuitive way. Unlike con dence, ation rules. Proc. of the Int'l Conf. on Very Large
it does not assign high values to rules simply because the Data Bases (VLDB), 1996.
consequent is popular. Unlike interest, it is directional and
is strongly a ected by the direction of the arrow so that it
can truly generate implication rules. However, as we found
out, a good ranking system for rules is not enough by itself.
More work needs to be done on rule pruning and ltering.
From our experiments we have learned that not all data
which ts into the market-basket framework behaves nearly
as well as supermarket data. We do not believe that this
is because it is a wrong choice of framework. It is simply
because doing this kind of analysis on data like census data
is dicult. There are very many correlations and redund-
ancies in census data. If we are aware beforehand of all
of its idiosyncrasies, we can probably simplify the problem
considerably (for example, by collapsing all the redundant
attributes) and nd a specialized solution for it. However,
we want to build a general system, capable of detecting and
reporting the interesting aspects of any data we may throw at
it. Toward this goal, we developed DIC to make the task less
painful to those of us who are impatient, and we developed
conviction so that the results of mining market-basket data
are more usable.

You might also like