Market Basket Analysis in A Multiple Store Environment: Yen-Liang Chen, Kwei Tang, Ren-Jie Shen, Ya-Han Hu
Market Basket Analysis in A Multiple Store Environment: Yen-Liang Chen, Kwei Tang, Ren-Jie Shen, Ya-Han Hu
www.elsevier.com/locate/dsw
Abstract
Market basket analysis (also known as association-rule mining) is a useful method of discovering customer purchasing
patterns by extracting associations or co-occurrences from stores’ transactional databases. Because the information obtained
from the analysis can be used in forming marketing, sales, service, and operation strategies, it has drawn increased research
interest. The existing methods, however, may fail to discover important purchasing patterns in a multi-store environment,
because of an implicit assumption that products under consideration are on shelf all the time across all stores. In this paper, we
propose a new method to overcome this weakness. Our empirical evaluation shows that the proposed method is computationally
efficient, and that it has advantage over the traditional method when stores are diverse in size, product mix changes rapidly over
time, and larger numbers of stores and periods are considered.
D 2004 Elsevier B.V. All rights reserved.
0167-9236/$ - see front matter D 2004 Elsevier B.V. All rights reserved.
doi:10.1016/j.dss.2004.04.009
340 Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354
and confidence is a measure of the accuracy of the product set. The results of the method may be biased,
rule, defined as the ratio of the number of transac- however, because a product may be on shelf before its
tional records with both X and Y to the number of first transaction and/or after the last transaction
transactional records with X only. By far, the Apriori occurs, and a product may also be put on-shelf and
algorithm [1] is the most known algorithm for mining taken off-shelf multiple times during the data collec-
the association rules from a transactional database, tion period.
which satisfy the minimum support and confidence The second problem is associated with finding
levels specified by users. common association patterns in subsets of stores.
Since association rules are useful and easy to Similar to the problem in using existing temporal
understand, there have been many successful business rules in a multi-store environment, we have to con-
applications, including, for example, finance, telecom- sider the possibility that some products may not be
munication, marketing, retailing, and web analysis sold in some stores, for example, because of geo-
[5]. The method has also attracted increased research graphical, environmental, or political reasons. This is
interest, and many extensions have been proposed in seemingly related to spatial association rules. Howev-
recent years, including (1) algorithm improvements er, the focus of spatial rules is on finding the associ-
[6,12,18,21]; (2) fuzzy rules [13,14]; (3) multi-level ation patterns that are related to topological or
and generalized rules [7,10]; (4) quantitative rules distance information in, for example, maps, remote
[20,24,25]; (5) spatial rules [7,15]; (6) inter-transac- sensing or medical imaging data and VLSI chip layout
tion rules [19]; (7) interesting rules [4,9]; and (8) data [23].
temporal association rules [3,16,17]. Brief literature To overcome the problems, we develop an Apriori-
reviews of association rules are given by Chen et al. like algorithm for automatically extracting association
[8] and Han and Kamber [11]. rules in a multi-store environment. The format of the
In today’s business world, it is common for a rules is similar to that of the traditional rules. How-
company to have subsidiaries, branches, dealers, or ever, the rules also contain information on store
franchises in different geographical locations. For (location) and time where the rules hold. The results
example, Wal-Mart, the largest supermarket chain in of the proposed method may contain rules that are
the world, has more than 4400 stores worldwide. For a applicable to the entire chain without time restriction
company with multiple stores, discovery of purchas- or to a subset of stores in specific time intervals. For
ing patterns that may vary over time and exist in all, or example, a rule may state: ‘‘In the second week of
in subsets of, stores can be useful in forming market- August, customers purchase computers, printers, In-
ing, sales, service, and operation strategies at the ternet and wireless phone services jointly in electron-
company, local, and store levels. ics stores near campus.’’ Another example is: ‘‘In
There are two main problems in using the existing January, customers purchase cold medicine, humidi-
methods in a multi-store environment. The first is fiers, coffee, and sunglasses together in supermarkets
caused by the temporal nature of purchasing patterns. near skiing resorts.’’ These rules can be used not only
An apparent example is seasonal products. Temporal for general or localized marketing strategies, but also
rules [3,16,17] are developed to overcome the weak- for product procurement, inventory, and distribution
ness of the static association rules that either find strategies for the entire store chain. Furthermore, we
patterns at a point of time or implicitly assume the allow an item to have multiple selling time periods;
patterns stay the same over time and across stores. A i.e., an item may be put on-shelf and taken off-shelf
literature review on temporal rules is given by Rod- multiple times. We further assume that different stores
dick and Spiliopoulou [22]. In temporal rules, selling can have different product-mixes in different time
periods are considered in computing the support periods. That is, each store can have its own prod-
value, where the selling period of a product is defined uct-mix, and the product-mix in a store can be
as the time between its first and last appearances in the dynamically changed over time.
transaction records. Furthermore, the common selling Because the time and store (location) factors are
period of the products in a product set is used as the considered, the rule generation procedure is more
base in computing the ‘‘temporal support’’ of the complicated than the Apriori algorithm. The simula-
Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354 341
tion results presented in the paper show that the timestamp, t, and store identifier, p, to indicate the
proposed method is computationally efficient and store and time that the transaction occurs.
has significant advantage over the traditional associ- Let SkpP and RkpT be the sets of the stores and
ation method when the stores under consideration are times that item Ik is sold, respectively. We define
diverse in size and have product mixes that change VIk = Sk Rk as the context of item Ik; i.e., the set of
rapidly over time. the combinations of stores and times where item Ik is
The paper is organized as follows. We formally sold. Furthermore, the context of itemset X, denoted
define the problem in Section 2 and in Section 3 by VX, is the set of the combinations of stores and
propose an algorithm. In Section 4, we compare the times that all items in X are sold concurrently. For
results generated from the proposed algorithm and the example, if itemset X consists of two items Ik and Ik’,
traditional Apriori algorithm in a simulated multi-store the context of X is given by VX = VIk\Vik’.
environment. The conclusion is given in Section 5.
Definition 2. Let X be an itemset in I with context VX,
and DVX the subset of transactions in D whose
2. Problem definition timestamps t and store identifiers p satisfy VX. We
define the relative support of X with respect to the
We consider a market basket database D that context VX, denoted by rel_sup(X, DVX), as AW (X,
contains transactional records from multiple stores DVX)A/ADVXA. For a given threshold for relative
over time period T. Our objective is to extract the support rr, if a frequent itemset X satisfies rel_sup(X,
association rules from the database. For convenience DVX) z rr, we call X a relative-frequent (RF) itemset.
in presentation, the cardinal of a set, say R, is denoted In the last definition, we require that a relative-
by ARA. Let I={I1, I2,. . ., Ir} be the set of product frequent itemset X be frequent. We add this restriction
items included in D, where Ik (1 V k V r) is the iden- for two reasons. First, it enables us to preserve the
tifier for the kth item. Let X be a set of items in I. We well-known downward-closure property, by which the
refer X as a k-itemset if AXA = k. Furthermore, a candidate set of the next phase can be obtained by
transaction, denoted by s, is a subset of I. We use joining the frequent sets of the preceding phase; this
W(X, D)={sAsaD^Xps} to denote the set of trans- will greatly improve the performance of the algorithm.
actions in D, which contain itemset X. Second, this restriction does not present any real
problem to the mining algorithm, because none of
Definition 1. The support of X, denoted by sup(X, D),
the important patterns would be missing because of
is the fraction of transactions containing X in database
using a low rs value. Therefore, we prefer using a low
D; i.e., sup(X, D) = AW(X, D)A/ADA. For a specified
rs value. However, it should not be too low because
support threshold rs, X is a frequent itemset if sup(X,
an itemset that occurs only in few transactions has no
D) z rs.
practical significance.
Note that the definitions of the support and the Furthermore, the minimum threshold for the rela-
frequent itemset are those used in the traditional tive support of an itemset is used to determine whether
association rules, and, therefore, the store and time a sufficient percentage of transactions exists in its
information is not considered in determining the context to warrant the inclusion of the itemset as a
support of an itemset. relative-frequent (RF) itemset. Its use and purpose are
Let {T1, T2,. . ., Tm} be the set of mutually disjoint similar to those of the traditional minimum support
time intervals (periods) and form a complete partition threshold. Consequently, we can set its value the same
of T. Furthermore, they are ordered, such that Ti + 1 way as we set the traditional minimum support
immediately follows Ti for i z 1. Note that the time threshold.
periods are defined according to specific needs of the
problem, such as 1 h, 6 h, 1 day, 1 week, and so on. Definition 3. Consider two itemsets X and Y. The
Let P={ P1, P2,. . ., Pq} be the set of stores, where Pj relative support of X with respect to the context Vx[y,
(1 V j V q) denotes the jth store in the store chain. We denoted by rel_sup(X, DVX[Y), is defined as AW(X,
assume that each transaction s in D is attached with a DVX[Y)A/ADVX[YA. The confidence of rule X Z Y,
342 Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354
denoted by conf(X Z Y), is defined as rel_sup(X[Y, RF itemsets. Furthermore, when we use the frequent
DVX[Y)/rel_sup(X, DVX[Y). itemsets to generate the candidate set in the next
The above definition implies that the context of phase, it still satisfies the anti-monotone property,
rule X Z Y is Vx[y; i.e., the base used to compute the because we use the same base to compute the
confidence of rule X Z Y is the common stores and supports for all itemsets.
time periods shared by all the items in X[Y. As the first step of the algorithm, we build a table,
called the PT table, for each item in I to associate the
Definition 4. Let Z be an RF itemset, where Z = X[Y, item with its context (i.e., the stores and times it is
XpI, and Yp I \ X. Given a confidence threshold rc, if sold) and use the table to determine the context of an
conf(X Z Y) z rc, we call X Z Y a store-chain (SC) itemset. The algorithm proceeds in phases, where in
association rule, and Vx[y as the context of the rule. the kth phase we generate Fk from Ck and RFk from
Based on Definitions 1 and 4, it is clear that the Fk. In the first phase, we scan the database for the first
selection criteria and outputs for the store-chain asso- time and build a two-dimensional table, called the TS
ciation rules are different from those of the traditional table. In this table, the entry at the position
association rules. For the store chain rules, the output corresponding to Ti and Pj, denoted by TS(Ti, Pj),
includes the confidence, support, and a context indi- records the number of transactions that occur at store
cating the stores and times the rules hold. Pj in period Ti. Using this table and the PT table for a
It can be shown that the traditional method under- given itemset X, we can determine the number of
estimates the support and the confidence values (a transactions associated with the context of X, i.e.,
proof is given in Appendix A). Consequently, impor- ADVXA. In the kth phase of the algorithm, we first
tant purchasing patterns that satisfy the criteria of the derive Ck, and, then, generate Fk by evaluating their
SC association rules may not be identified by the supports, which can be done by scanning the database
traditional association-rule methods. and removing all infrequent itemsets. Since an RF
itemset must be a frequent itemset, we generate RFk
from Fk by evaluating the relative supports of the
3. Algorithm itemsets X in Fk.
In the following subsections, we give detailed
We propose an Apriori-like algorithm for mining descriptions for the key elements of the algorithm,
the store-chain association rules. The algorithm is including methods of (1) building the PT table, (2)
outlined in Fig. 1. We first explain the general concept building the TS table in the first phase, (3) finding
for developing the algorithm and then use five sub- RFk, (4) generating candidate itemsets, and (5) gen-
sections to give detailed information on several key erating the store-chain association rules.
steps of the algorithm.
In describing the algorithm, we use RFk to denote 3.1. The PT table
the set of all relative-frequent k-itemsets; Fk, the set
of all frequent k-itemsets; and Ck, the set of candi- The purpose of the PT table is to efficiently store
date k-itemsets. Note that, in the traditional Apriori the time and store information for each product item
algorithm, a k-item candidate itemset must be a in the database. We use a simple example to illustrate
combination of k 1 frequent itemsets because of the procedure for constructing the table. Consider the
the anti-monotone property [1]. Therefore, the Apri- bit matrices in Fig. 2 for items I1, I2, and I3, in which
ori algorithm can generate the candidate itemsets in there are six stores and six selling periods, and ‘‘1’’
the kth phase by joining the frequent itemsets in the and ‘‘0’’ indicate, respectively, that the item is or is
(k 1)th phase. However, for the SC association not for sale in the corresponding store and time.
rule, a subset of an RF itemset may not be an RF Because an item normally does not switch between
itemset because the base for calculating the relative on- and off-shelf very frequently in a typical applica-
support value varies in different phases. Consequent- tion, we store an item’s context information in the PT
ly, in the proposed algorithm, we generate candidate table instead of the bit matrix in order to conserve data
itemsets from the frequent itemsets, instead of the storage space. In the PT table, we need only to record
Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354 343
Fig. 4. The method to compute the jth row of the PT table for itemset X.
Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354 345
example of the table is given in Fig. 5. Using the TS A. Furthermore, assume that the selling periods for
and PT tables for itemset X, we can determine the product B are from T6 to T15, and that 80 transactions
value ADVXA by summing all the values in the entries of them include product B. Finally, there are 50
of the TS table according to the store and time transactions containing both products A and B, and
information of the items in X. The process of con- they are sold in periods from T6 to T10.
structing the table is described in lines 2 through 4 in
Fig. 1. In order to compute the supports and the relative
supports for itemsets {A}, {B}, and {A, B}, we
3.3. Relative-frequent itemset identify the following values: AW({A}, Dv{A})A =
AW({A}, D)A = 60, AW({B}, Dv{B} )A = AW({B},
Because an RF itemset must be a frequent itemset, D)A = 80, and AW({A, B}, Dv{A,B})A = AW({A, B},
we can generate RFk from Fk by computing the D)A = 50. Since the base for computing the support is
relative supports of those itemsets X in Fk. It is evident ADA = 300, the supports for the three itemsets are
that AW(X, DVX)A equals AW(X, D)A because it is not given by sup({A}, D) = 60/300 = 0.2, sup({B},
possible for X to appear in a transaction not in DVX. D) = 80/300 = 0.267, and sup({A, B}, D) = 50/
Further, ADVXA can be obtained from the TS and PT 300 = 0.167, respectively. On the other hand, the bases
tables of X. As a result, we can find the RF itemsets by for computing the relative support are ADv{A}A = 195,
first computing the relative supports of all X in Fk and ADv{B}A = 205, and ADv{A[B}A = 100, respectively,
then pruning those itemsets whose relative supports for the three itemsets. As a result, the relative supports
are less than rr . a r e re l _ s u p ( { A } , D v { A } ) = 6 0 / 1 9 5 = 0 . 3 0 8 ,
rel_sup({B}, Dv{B}) = 80/205 = 0.39, rel_sup({A, B},
3.4. Candidate itemsets Dv{A,B}) = 50/100 = 0.5 for the itemsets.
Suppose we set rs at 0.1 and rr at 0.35. Then, we
As discussed, we generate the candidate itemsets find that {A}, {B}, and {A, B} are frequent. Further-
from the frequent itemsets, instead of the RF itemsets, more, {A} is not relative-frequent, but {B} and {A,
from the last phase. Furthermore, when we use the B} are relative-frequent.
frequent itemsets to generate the candidate set in the
next phase, it still satisfies the anti-monotone property, 3.5. The store-chain association rules
because we use the same base to compute the supports
for all itemsets. We illustrate the computation process Having found the RF itemsets, we proceed to
by the following example. calculate the confidence values and to find all the
SC association rules. As defined in Definition 3,
Example 1. Suppose there are 15 periods, from T1 to
the confidence value is given by conf(X Z Y) =
T15, and the numbers of transactions occurring in
rel_sup(X[Y, DVX[Y)/rel_sup(X, DVX[Y). If the confi-
these 15 periods are 19, 17, 14, 25, 20, 17, 15, 27, 21,
dence value exceeds rc, the SC association rule holds.
20, 22, 18, 25, 21, and 19, respectively. Assume that
There is an issue that must be dealt with in
the selling periods of product A are from T1 to T10,
computing the confidence value. In the calculation
and that there are 60 transactions containing product
of rel_sup(X[Y, DVX[Y)/rel_sup(X, DVX[Y), we obtain
the numerator after the phase of processing X[Y. But
the denominator is still undetermined after the phase
of processing X[Y, because the length of X is smaller
than that of X[Y, and we process the itemsets of the
same length in a single phase. One possible solution
to this problem is to add one step after the phase of
processing X[Y. In this new step, we compute support
levels of all subsets of X[Y under the context of VX[Y;
i.e., the support levels of X in database DVX[Y, where X
Fig. 5. An example of the TS table. is a subset of X[Y.
346 Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354
If the RF itemset produced in each phase needs support levels of all the subsets X of Z under the
another scan to produce the confidence value, then the context of Vz.
number of scans of the database in this algorithm is For example, suppose that two RF itemsets are
twice that required by the Apriori algorithm. In order generated in the third phase: {A, B, C} and {C, D, E}.
to reduce this requirement, we use another method: if In the fourth phase, we build the PT tables for all the
Z is an RF itemset found in the kth phase, we compute RF itemsets in RF3. And when a transaction is read,
rel_sup(X, DVZ) in the (k + 1) phase by ‘‘hitchhiking,’’ we need to check whether it includes any candidates
where X is a subset of Z. In other words, in phase in C4, as well as whether its time and store combina-
k + 1, we perform two operations: the first is to find tion is in the contexts of {A, B, C} or {C, D, E}. If the
the RF itemset of length k + 1, and the second is to time and store combination of the transaction does not
compute the relative supports, such as rel_sup(X, conform to the context of {A, B, C}, we need to check
DVZ). All these values are calculated during the same whether it does to that of {C, D, E}. If it does, we
scan of the database. Consequently, the proposed proceed to check whether it includes any subsets like
method requires only one more scan than the Apriori {C, D, E}: {C}, {D}, {E}, {C, D}, {C, E}, and {D,
algorithm to obtain the confidence value when the RF E}. If it does, the counters of all the matching subsets
itemset of the last phase is produced. This process is are increased by one.
included as line 11 in the algorithm, and in Fig. 6, we Finally, line 14 in Fig. 1 shows the step for
give the process of computing rel_sup(X, DVZ) for all generating the store chain association rule X Z Y,
subsets X of Z. where X[Y is in RFk 1. It is not difficult to compute
In order to compute rel_sup(X, DVZ) for all subsets the confidence of the rule—i.e., rel_sup(X[Y, DVX[Y)/
X of Z, we must enumerate all the subsets, X, of each rel_sup(X, DVX[Y)—because rel_sup(X[Y, DVX[Y) has
RF itemset, Z, in the previous phase—if the length of already been found in the previous phase and
the RF itemset is k, the number of subsets is 2k 2. rel_sup(X, DVX[Y) found in the current phase.
Because each Z has its own PT table (built in line 8 of
Fig. 1), we need to check whether a transaction 3.6. Complexity analysis
happens in the PT tables of all RF itemsets Z every
time a transaction is read in, after computing the In this section, we analyze the time complexity and
supports of the candidates in Ck. If not, it indicates memory space complexity of the algorithm. Let m be
that this transaction does not happen under the the number of items, n the number of transactions in
context of Vz, and it can be ignored. On the other the database, l the number of items in a transaction.
hand, if the answer is positive, it indicates that this Further, let x denote the largest value of ACkA. Note
transaction happens under the context of Vz, and, as a that, although ACkA can theoretically be as large as
result, we need to check if the transaction includes O(mk), ACkA is very unlikely to be larger than O(m2)
any subset X of Z. This enables us to determine the in practice. This is because, in an Apriori-like algo-
rithm [1,2,6,20], C2 usually has the largest size among to all subsets of itemsets in RFk 1 rather than
all candidate sets. We discuss the time complexities of all itemsets in RFk 1. Therefore, it can be
the steps of Apriori_TP algorithm separately, as well done by first reading every transaction and
as the total time complexity of the algorithm, as every itemset in RFk 1, generating all subsets,
follows. and finally adding the counts. Performing
these operations requires time Oðn l
1. In step 1, we construct the PT table for each item. ARFk 1 A 2k 1 Þ . Since OðARFk 1 AÞ V
To produce the table for an item, its Bit Matrix OðACk 1 AÞ , the time required for this
table with APA rows and ATA columns needs to be part is Oðn l x 2k 1 Þ.
linearly scanned and processed. Thus, the time for 4.4. Step 12 is used to generate Fk from Ck. Since
step 1 is O(m A P A A T A). the support of each itemset in Ck must be
2. In steps 2 to 4, two operations are performed: (1) checked if it is no less than rs, the time is
compute the supports of all itemsets in C1, and OðACk AÞ ¼ Oðx Þ.
(2) construct the TS table. Since the first 4.5. Step 13 is for generating RFk from Fk. Because
operation requires a linear scan of all the items the support of each itemset in Fk must be
in every transaction, its time is O(n l). The time checked if it is no less than rr , the time is
needed for the second operation is O(n) because O(AFkA). Since OðAFk AÞ V OðACk AÞ, we have
we examine the attached time and store identifier the total time O(x ).
of each transaction, it requires time O(n). As a 4.6. In steps 14 through 17, we compute the
result, the total time for the three steps is confidence of x Z y where x[y in RFk 1.
O(n l). That means, for each z = x[y in RFk 1, we
3. Step 5 is for determining F1 by examining the need to check all of its subsets. Therefore,
support of every itemset in C1. Since C1 has n there are totally ARFk 1 A 2k 1 possible
itemsets, the time needed for the step is O(n). combinations. Since each combination needs
4. There is a loop from steps 6 to 18. The time a simple division, the total time for this part is
complexities of the steps in iteration k of the loop OðARFk 1 A 2k 1 Þ. Furthermore, because
are discussed as follows. OðARFk 1 AÞ V OðACk 1 AÞ , the total time
4.1. In step 7, we generate Ck. Consequently, the required is O(x 2k 1).
required time is O(ACkA). Because we assume
O(ACkA) V O(x ), the time is O(x ). From the above analysis, we know that two parts of
4.2. In step 8, we build a PT table for each itemset the algorithm are most time consuming. The first is
z in RFk 1. We need k 2 merging operations step 8, and the second is steps 9 through 11, which
for this step because the k 1 PT tables of require times Oðx k APA AT AÞ and Oðn
every individual item in z need to be merged. l x 2k 1 Þ, respectively. Let K denote the total
Because each merging operation can be done number of the phases in the loop from step 6 to step 17.
in time O(APA ATA), the total time for step Then the total time is Oðx K 2 APA AT AÞ þ
8 is O(ARFk 1A k APA ATA). Since Oðn l x K 2K Þ.
RFk 1 pCk 1 , we have O(ARFk 1A) VO(x ) Next, we analyze the memory space required for
and the total time becomes O(x k the algorithm. We perform the analysis by examining
APA ATA). the space needed to store the data structures used in
4.3. In steps 9 to 11, there are two tasks: (1) the algorithm.
compute the supports of all itemsets in Ck, and
(2) compute the supports of all subsets of 1. Because the space requirement for the PT-
itemsets in RFk 1. The time required for the Interval table for each item is OðAPA AT AÞ,
first task is O(n l ACkA), because it can be the total requirement for all individual items is
done by first reading every transaction and Oðm APA AT AÞ.
then adding the counts to the corresponding 2. The requirement for the PT-Interval table for each
itemsets. In the second task, we add the counts itemset in RFk 1 is OðARFk 1 A APA AT AÞ.
348 Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354
r
4.1. Data generation Ni ¼ ISi
MaxðISi Þ
In the experiment, we randomly generate the
synthetic transactional data sets by applying the data Note that the products sold in a store may change
generation algorithm proposed by Agrawal and Sri- over time, although Ni is kept the same in all periods.
kant [1]. The factors considered in the simulation are Since the parameter Id is the proportion of products
listed in Table 1. In addition, we generate the time that will be replaced in every period, store i replaces
and store information for each transaction in the data Ni Id products in each period. Furthermore, we
sets. follow the method used by Agrawal and Srikant [1]
To generate the store sizes, we use two parameters, to generate Fd maximum potentially frequent itemsets
Su and Sl, to represent the largest and smallest store with an average length of Fl.
Y.-L. Chen et al. / Decision Support Systems 40 (2005) 339–354 349
Finally, we generate all the transactions in the data The type A error measures the relative difference
sets. To generate the transactions for store i in period j, in the support levels of all frequent itemsets gener-
we generate Dij from a Poisson distribution with mean ated by the traditional and proposed methods. It is
L and a series of maximum potentially frequent item- determined by rel _ sup(X, D V X ) sup(X, D)/
sets. If an itemset generated from the process has rel_sup(X, DVX). For example, if the support and
some items not sold at store i in period j, we remove relative support for an itemset X are sup(X, D) = 0.02
these items, and repetitively add the items into the and rel_sup(X, DVX) = 0.03, respectively, then the
transaction until we have reached the intended size. If type A error rate is rel_sup(X, DVX) sup(X, D))/
the last itemset exceeds the boundary of this transac- rel_sup(X, DVX) = 33.33%. By averaging the error
tion, we remove the part that exceeds the boundary. rates of all frequent itemsets, we obtain the overall
When adding an itemset to a transaction, we use a type A error rate. Similarly, the type B error is used
‘‘corruption level,’’ c = 0.7, to simulate the phenome- to compare the difference in confidence levels of all
non that all the items in a frequent itemset do not rules generated by the traditional and proposed meth-
always appear together. Information on how the ods. It is defined as conf(X Z Y) conf V(X Z Y))/conf
corruption level affects the procedure of generating (X Z Y), where conf V(X Z Y) is the rule confidence
items for a transaction is included in the paper by computed by the traditional methods. By averaging
Agrawal and Srikant [1]. To generate the nine types of the type B error rates of all common rules in the two
data sets shown in Table 2, we use the following methods, we obtain the overall type B error rate.
parameter values: r = 1000, D = 100 K, L = 6, Fl = 4, Finally, the type C error is used to compare the relative
and Fd = 1000. For each type of the data sets, 10 difference in the numbers of rules generated by the two
replications are generated for statistical analysis of the methods. Note that we set rs and rr at the same level
results. when evaluating the types A and B error rates. It is
because the frequent itemsets found by the two algo-
4.2. Performance measures rithms have to be the same in order to have a common
base to compare the results produced by the two
As discussed in Section 2, the traditional method algorithms. Furthermore, we set rc at 1% in the
underestimates the support and the confidence values comparison based on the type B error. Using this
and, as a result, may fail to identify important pur- low value, we can include almost all possible rules
chasing patterns in a multiple-store environment. We in the comparison. However, because in a practical
define three measures (errors) for empirically assess- situation the minimum confidence threshold could be
ing the magnitudes of the deviations in support, higher than this value, we also obtain the results for
confidence, and the number of association rules when selected minimum confidence values ranging from
we use the traditional association rules for the store- 40% to 60%. Finally, we set rs at 0.5% in the
chain data. comparison based on the type C error.
Fig. 7. (a) Effects of the numbers of stores and periods on the type A
error rate. (b) Effects of the numbers of stores and periods on the
type B error rate. (c) Effects of the numbers of stores and periods on
the type C error rate.
on Knowledge Discovery in Databases, Edmonton, Canada, Kwei Tang is Professor of Management and the area coordinator of
2003, July. quantitative methods in the Krannert Graduate School of Manage-
[19] H. Lu, L. Feng, J. Han, Beyond intra-transaction association ment at Purdue University. He received a BS from National Chiao
analysis: mining multi-dimensional inter-transaction associa- Tung University, Taiwan, an MS from Bowling Green State
tion rules, ACM Transactions on Information Systems 18 (4) University, and a PhD in Management Science from Purdue
(2000) 423 – 454. University. His current research interests include data mining,
[20] J.-S. Park, M.-S. Chen, P.S. Yu, Using a hash-based method supply chain management, and quality management.
with transaction trimming for mining association rules, IEEE
Transactions on Knowledge and Data Engineering 9 (1997) Ren-Jie Shen is a system analyst and designer in Data Systems
813 – 825. Consulting, a leading commercial software company in Taiwan. He
[21] R. Rastogi, K. Shim, Mining optimized association rules with received his MS degree in Information Management from National
categorical and numeric attributes, IEEE Transactions on Central University of Taiwan. His research interests include data
Knowledge and Data Engineering 14 (2002) 29 – 50. mining, information systems and EC technologies.
[22] J.F. Roddick, M. Spiliopoilou, A survey of temporal knowl-
edge discovery paradigms and methods, IEEE Transactions on Ya-Han Hu is currently a PhD student in the Department of
Knowledge and Data Engineering 14 (2002) 750 – 767. Information Management, National Central University, Taiwan.
[23] S. Shekhar, S. Chawla, S. Ravadam, A. Fetterer, X. Liu, C. He received the MS degree in Information Management from
Lu, Spatial databases—accomplishments and needs, IEEE National Central University of Taiwan. His research interests
Transactions on Knowledge and Data Engineering 11 (1999) include data mining, information systems and EC technologies.
45 – 55.
[24] R. Srikant, R. Agrawal, Mining quantitative association rules
in large relational tables, Proceedings of the ACM-SIGMOD
1996 Conference on Management of Data, Montreal, Canada,
1996, pp. 1 – 12, June.
[25] J. Wijsen, R. Meersman, On the complexity of mining quan-
titative association rules, Data Mining and Knowledge Dis-
covery 2 (1998) 263 – 281.