Dic Paper
Dic Paper
2-itemsets
Transactions
Transactions
3-itemsets
1-itemsets
2-itemsets
1-itemsets
3-itemsets
3 passes 1.5 passes
Figure 1: Apriori and DIC
AB AC BC AD BD CD AB AC BC AD BD CD
A B C D A B C D
{} {}
Figure 3: Start of DIC algorithm. Figure 4: After M transactions.
ABCD ABCD
AB AC BC AD BD CD AB AC BC AD BD CD
A B C D A B C D
{} {}
Figure 5: After 2M transactions. Figure 6: After one pass.
6. If any dashed itemsets remain, go to step 2. itemsets which occur in the transaction. This must be
very fast as it is the bottleneck of the whole process.
This way DIC starts counting just the 1-itemsets and We attempt to optimize this operation in Section 3.
then quickly adds counters 2,3,4,:: : ,k-itemsets. After just
a few passes over the data (usually less than two for small 3. Maintain itemset states by managing transitions from
values of M ) it nishes counting all the itemsets. Ideally, active to counted (dashed to solid) and from small to
we would want M to be as small as possible so we can start large (circle to square). Detect when these transitions
counting itemsets very early in step 3. However, steps 3 and should occur.
4 incur considerable overhead so we do not reduce M below
100. 4. When itemsets do become large, determine what new
itemsets should be added as dashed circles since they
2.1 The Data Structure could now potentially be large.
The implementation of the DIC algorithm requires a data The data structure used for this is exactly like the hash
structure which can keep track of many itemsets. In partic- tree used in Apriori with a little extra information stored at
ular, it must support the following operations: each node. It is a trie with the following properties. Each
itemset is sorted by its items (the sort order is discussed in
1. Add new itemsets. Section 3). Every itemset we are counting or have counted
has a node associated with it, as do all of its prexes. The
2. Maintain a counter for every itemset. When transac- empty itemset is the root node. All the 1-itemsets are at-
tions are read, increment the counters of those active tached to the root node, and their branches are labeled by the
the necessary itemsets over the whole data in just one pass.
However, this pays the added penalty of having to count
more itemsets due to the reduced threshold. This can be
{} A B C quite costly, particularly for datasets like the census data
(see section 5). Instead of being conservative, our algorithm
bravely marches on, on the assumption that it4 will later come
back to anything missed with little penalty.
Besides performance, DIC provides considerable
exibil-
C ity by having the ability to add and delete counted itemsets
on the
y. As a result, DIC can be extended to parallel and
incremental update versions (see section 6.1.1).
2.3 Non-homogeneous Data
D One weakness of DIC is that it is sensitive to how homogen-
eous the data is. In particular, if the data is very correlated,
we may not realize that an itemset is actually large until
we have counted it in most of the database. If this hap-
pens, then we will not shift our hypothetical boundary and
B C D start counting some of the itemset's supersets until we have
almost nished counting the itemset. As it turns out, the
census data we used is ordered by census district and ex-
actly this problem occurs. To test the impact of this eect,
we randomized the order of the transactions and re-ran DIC.
D It turned out to make a signicant dierence in performance
(see Section 5). The cost associated with randomizing trans-
action order is small compared to the mining cost.
However, randomization may be impractical. For ex-
ample, it may be expensive, the data may be stored on tape,
or there might be insucient space to store the random-
C D ized version. We considered several ways of addressing this
problem:
Virtually randomize the data. That is, visit the le in
a random order while making sure that every pass is
D in the same order. This can incur a high seek cost,
especially if the data is on tape. In this case, it may be
sucient to jump to a new location every few thousand
Figure 7: Hash Tree Data Structure transactions or so.
Slacken the support threshold. First, start with a sup-
port threshold considerably lower than the given one.
item they represent. All other itemsets are attached to their Then, gradually increase the threshold to the desired
prex containing all but their last item. They are labeled by level. This way, the algorithm begins fairly conservat-
that last item. Figure 7 shows a sample trie of this form. The ive and then becomes more condent as more data is
dotted path represents the traversal which is made through collected. We experimented with this technique some-
the trie when the transaction ABC is encountered so A, AB, what but with little success. However, perhaps more
ABC, AC, B, BC, and C must be incremented and they are, careful control of the slack or a dierent dataset would
in that order. The exact algorithm for this is described in make this a useful technique.
Section 3.
Every node stores the last item in the itemset it repres- One thing to note is that if the data is correlated with
ents, a counter, a marker as to where in the le we started its location in the le, it may be useful to detect this
counting it, its state, and its branches if it is an interior node. and report it. This is possible if a \local" counter is
kept along with each itemset which measures the count
2.2 Signicance of DIC of the current interval. At the end of each interval it
can be checked for considerable discrepancies with its
There are a number of benets to DIC. The main one is overall support in the whole data set.
performance. If the data is fairly homogeneous throughout
the le and the interval M is reasonably small, this algorithm The DIC algorithm addresses the high-level strategy of
generally makes on the order of two passes. This makes the what itemsets to count when. There are also lower level
algorithm considerably faster than Apriori which must make performance issues as to how to increment the appropriate
as many passes as the maximum size of a candidate itemset. counters for a particular transaction. We address these in
If the data is not fairly homogeneous, we can run through it section 3.
in a random order (section 2.3). 4 We did not have time to implement and test Toivonen's algorithm
Some important relevant work was done by Toivonen as compared to ours. However, based on tests with lowered support
using sampling [Toi96]. His technique was to sample the thresholds, we suspect that DIC is quite competitive.
data using a reduced threshold for safety, and then count
; 7 cost of the insert is 31 (see Table 3). However, if we change
A 6 the order of the items to EFGABCD (note the tree structure
AB 5 remains the same since A,B,C,D did not change order), the
B 5 cost becomes 16 (see Table 3).
BC 4 This is considerably cheaper. Therefore what we want is
C 4 to order the items by the inverse of their popularity in the
totala 31 counted non-leaf itemsets. A reasonable approximation for
this inverse is the inverse of their popularity in the rst M
Table 1: Increment Cost for ABCDEFG transactions. Since during the rst interval of transactions
we are counting only 1-itemsets, there is not yet a tree struc-
a ABC, AC, AD, BCD, BD, CD, D, E, F, and G ture which depends on the order. After the rst M trans-
cost 0 since they are leaves. actions, we change the order of the items and build the tree
; 7 from there. Future transactions must be resorted according
A 3 to the new ordering. This technique incurs some overhead
AB 2 due to the re-sorting, but for some data it can be benecial
B 2 overall.
BC 1
C 1 4 Implication rules
total 16
Some traditional measures of \interestingness" have been
Table 2: Increment Cost for EFGABCD support combined with either condence or interest. Con-
sider these measures from a probabilistic model.
Let fA; B g be an itemset. Then the support is P (fA; B g)
3 Item Reordering which we write P (A; B ). This is used to make sure that the
items this rule applies to actually occur frequently enough
The low-level problem of how to increment the appropriate for someone to care. It also makes the task computationally
counters for a given transaction is an interesting one in itself. feasible by limiting the size of the result set, and is usually
Recall that the data structure we use is a trie structure much used in conjunction with other measures. The condence of
like that used in Apriori (see section 2.1). Given a collection A ) B is P (B j A), the conditional probability of B given
of itemsets, the form of this structure is heavily dependent A, which is equal to P (A;B )=P (A). It has the
aw that
on the sort order of the items. Note that in our sample data it ignores P (B ). For example, P (A;B )=P (A) could equal
structure (gure 7), the order of the items was A,B,C,D. P (B ) (i.e. the occurrence of B is unrelated to A) and could
Because of this, A occurs only once in the trie while D occurs still be high enough to make the rule hold. For example,
ve times. if people buy milk 80% of the time in a supermarket and
To determine how to optimize the order of the items, it the purchase of milk is completely unrelated to the purchase
is important to understand how the counter incrementing of smoked salmon, then the condence of salmon ) milk is
process works. We are given a transaction, S (with items still 80%. This condence is quite high, and therefore would
S [0] : : : S [n]), in a certain order. To increment the appropri- generate a rule. This is a key weakness of condence, and
ate counters we do the following, starting at the root node of is particularly evident in census data, where many items are
the trie T : very likely to occur with or without other items.
The interest of A; B is dened as P (A; B )=P (A)P (B ) and
Increment(T,S) f factors in both P (A) and P (B ); essentially it is a measure
/* Increment this node's counter */ of departure from independence. However, it only measures
T.counter++; co-occurrence not implication, in that it is completely sym-
If T is not a leaf then forall i, 0 i n: metric.
/* Increment branches as necessary */ To ll the gap, we dene conviction as P (A)P (:B )=P (A; :B ).
If T.branches[S[i]] exists: The intuition as to why this is useful is: logically, A ! B
Then Increment(T.branches[S[i]], S[i+1..n]) can be rewritten as :(A ^ :B ) so we see how far A ^ :B
Return. g deviates from independence, and invert the ratio to take care
of the outside negation 5 . We believe this concept is useful
X
Therefore, the cost of running this subroutine is equal to:
n ? Index(Last(I );S )
for a number of reasons:
Unlike condence, conviction factors in both P (A) and
P (B ) and always has a value of 1 when the relev-
I
ant items are completely unrelated like the salmon and
where I ranges over non-leaf itemsets in T which occur in milk example above.
S , and n ? Index(Last(I );S ) is the number of items left in Unlike interest, rules which hold 100% of the time, like
S after the last element of I . These items will be checked Vietnam veteran ) more than ve years old have the
in the inner loop. Therefore, it is advantageous to have the highest possible conviction value of 1. Condence
items which occur in many itemsets to be last in the sort also has this property in that these rules have a cond-
order of the items (so few items will be left after them) and ence of 1. However, interest does not have this useful
the items which occur in few itemsets to be rst. property. For example, if 5% of people are Vietnam
For example, consider the structure in gure 7. Suppose
there are also items E,F, and G, and we add their respective 5 In practice we do not invert the ratio; instead we search for low
1-itemsets to the data structure. There will now be three values of the uninverted ratio. This way we do not have to deal with
singletons hanging o the tree. If we insert ABCDEFG, the innities.
veterans and 90% are more than ve years old, we 300
get interest = 0:05=(0:05) 0:09 = 1:11 which is only
slightly above 1 (the interest for completely independ- 250 Apriori
ent items).
Figure 10: Eect of Varying Interval Size on Performance Figure 11: Performance With and Without Item Reordering
used for supermarket analysis. Otherwise, far too many We also tried varying M for non-randomized data. These
large itemsets would be generated. Even at the 36% support experiments failed miserably and we were not able to com-
threshold, mining proved time consuming, taking nearly half plete them. In terms of the number of passes, at a sup-
an hour on just 30,000 records. port threshold of 0.36, Apriori made 10 passes over the data;
simple DIC made 9 with an M of 10,000; randomized DIC
5.3.1 Performance on the Census Data made 4, 2.1, 1.3, and 1.3 passes for values of M of 10000,
1000, 300, and 100 respectively. This shows that DIC when
There are several reasons why mining the census data is combined with randomization and a suciently low M , does
so much more dicult than the synthetic data. The census indeed nish in a very small number of passes.
data is 3.5 times wider than the synthetic data. So if we
were counting all 2-itemsets, it would take 12 times longer 5.5 Eect of Item Reordering
per transaction (there are 12 times as many pairs in each
row of the census data). If we were counting all 3-itemsets Item reordering was not nearly as successful as we had hoped.
it would take 40 times longer; 4-itemsets would take 150 It made a small dierence in some tests but overall played
times longer. Of course we are not counting all the 2,3, or 4- a negligible role in performance. In tests on census data it
itemsets; however, we are counting many of them, and we are made less than 10% dierence, sometimes in the wrong dir-
counting higher cardinality itemsets as well. Furthermore, ection (Figure 11). This was something of a disappointment,
even after taking out the items which have more than 80% but perhaps a better analysis of what the optimal order is
support, we are still left with many popular items, such as and on-the-
y modication will yield better results.
works 40 hours/week. These tend to combine to form many
long itemsets. 5.6 Tests of Implication Rules
The performance graphs show three curves (gure 9).
One for Apriori, one for DIC, and one for DIC when we It is very dicult to quantify how well implication rules
shued the order of the transactions beforehand (this has work. Due to the high support threshold, we considered
no eect on Apriori). For both tests with DIC, M was rules based on the minimal small itemsets as well as the
10,000. The results clearly showed that DIC runs notice- large itemsets. In total there were 23712 rules with con-
ably faster than Apriori and randomized DIC runs noticeably viction > 1:25 of which 6732 had a conviction of 1. From
faster than DIC. For the support level of 0.36, randomized these, we learned that ve year olds don't work, unemployed
DIC ran 3.2 times faster than Apriori. By varying the value residents don't earn income from work, men don't give birth,
of M , we achieved slightly higher speedups - 3.3 times faster and many other interesting facts. Looking down the list to
for support of 0.36 and 3.7 times faster for 0.38 (see next a conviction level of 50, we nd that those who are not in
section). the military, are not looking for work, and had work this
year (1990, the year of the census), are currently employed
5.4 Varying the Interval Size as civilians. We list some sample rules in Table 3. Note that
one problem was that many rules were very long (involving
One experiment we tried was to nd the optimal value of say seven items) and were too complicated to be interesting.
M , the interval size (Figure 10). We tried values of 100, Therefore, we list some of the shorter ones.
300, 1000, and 10000 for M . The values in the middle of By comparison, tests with condence produced some mis-
the range, 300 and 1000, worked the best, coming in second leading results. For example, the condence of women who
and rst respectively. An interval size of 100 proved the do not state whether they are looking for a job do not have
worst choice due to too much overhead incurred. A value of personal care limitations was 73% which is at the high end
10,000 was somewhat slow because it took more passes over of the scale. However, it turned out that this was simply
the data than for lower interval values. because 76% of all respondents do not have personal care
conviction implication rule
1 ve year olds don't work
1 unemployed people don't earn income from work
1 men don't give birth
people who are not in the military and are not looking for work
50 and had work this year (1990, the year of the census) currently
have civilian employment
10 people who are not in the military and who worked last week
are not limited in their work by a disability
2.94 heads of household do not have personal care limitations
1.5 people not in school and without personal care limitations have
worked this year
1.4 African-American women are not in the military
1.28 African-Americans reside in the same state they were born
1.28 unmarried people have moved in the past ve years
Table 3: Sample Implication Rules From Census Data