0% found this document useful (0 votes)
21 views5 pages

KDD 98

The paper introduces a new indexing structure called Group Bitmap Index to enhance the retrieval of association rules and item sets in relational databases, addressing the limitations of traditional indexing methods like B+ trees and bitmap indexes. Experimental results demonstrate that the Group Bitmap Index significantly improves subset search performance, which is crucial for mining association rules from large datasets. The paper also discusses two types of group bitmap indexes: simple and hash group bitmap indexes, detailing their construction and application in efficient data retrieval.

Uploaded by

Marcel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views5 pages

KDD 98

The paper introduces a new indexing structure called Group Bitmap Index to enhance the retrieval of association rules and item sets in relational databases, addressing the limitations of traditional indexing methods like B+ trees and bitmap indexes. Experimental results demonstrate that the Group Bitmap Index significantly improves subset search performance, which is crucial for mining association rules from large datasets. The paper also discusses two types of group bitmap indexes: simple and hash group bitmap indexes, detailing their construction and application in efficient data retrieval.

Uploaded by

Marcel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Group Bitmap Index: A Structure for Association Rules Retrieval

Tadeusz Morzy, Maciej Zakrzewicz

Institute of Computing Science


Poznan University of Technology
ul. Piotrowo 3a, 60-695 Poznan, Poland
[email protected], [email protected]

Abstract set (item set satisfies the rule) if X∪Y is contained in the
Discovery of association rules from large databases of item set. We say, that the rule is violated by a given item set
sets is an important data mining problem. Association rules (item set violates the rule) if the set contains X, but does
are usually stored in relational databases for future use in not contain Y. Each rule has two measures of its statistical
decision support systems. In this paper, the problem of asso- importance and strength: support and confidence. The
ciation rules retrieval and item sets retrieval is recognized as support of the rule is the number of item sets that satisfy
the subset search problem in relational databases. The subset the rule divided by the number of all item sets. The rule
search is not well supported by SQL query language and confidence is the number of item sets that satisfy the rule
traditional database indexing techniques. divided by the number of item sets that contain X.
We introduce a new index structure, called Group Bitmap
Index, and compare its performance with traditional index-
ing methods: B+ tree and bitmap indexes. We show expe- transaction_id items
rimentally that proposed index enables faster subset search 1 bread, butter
and significantly outperforms traditional indexing methods. 2 bread, butter, milk, apples
3 bread, butter, milk, apples
bread → butter
Introduction bread ∧ butter ∧ milk → apples
Knowledge discovery in databases (KDD) is a new area of Fig 1. Example of a database and discovered rules
database research that aims at finding previously unknown
and potentially useful patterns in large databases (Fayyad, Usually, the discovered rules are stored in a database for
Piatetsky-Shapiro, and Smyth 1996). The most commonly future retrieval by users or decision support systems. The
sought patterns are association rules (Agrawal, Imielinski, users may iteratively penetrate the set of rules discovered
and Swami 1993), (Piatetsky-Shapiro and Frawley 1991), from the given database from many points of view.
(Agrawal and Srikant 1994), (Savasere, Omiecinski, and Moreover, users might be interested in finding customer
Navathe 1995), (Toivonen 1996). Formally, by an transactions that e.g. violate a given association rule. Such
association rule we mean a formula of the form X→Y, queries, often referred to as relational division (Graefe and
where X and Y are two sets of items. Association rules are Cole 1995), require analyzing of set-valued attributes and
discovered from database tables that store sets of items. are not supported by popular DBMSs.
Consider a supermarket database where the set of items In this paper we consider retrieval of association rules
purchased by a customer on a single visit to a store is and item sets from a relational database. We generalize
recorded as a transaction. The supermarket managers both retrieval problems to the subset search problem and
might be interested in finding associations among the introduce a new index structure, called group bitmap index.
items purchased together in one transaction. An example of We show the results of the experiment in which the group
a supermarket database and a set of association rules bitmap index significantly outperforms the traditional
derived from the database are presented in Fig. 1. The indexing methods, namely B+ tree and Bitmap indexes.
example discovered rule: bread ∧ butter ∧ milk → apples
states that a customer who purchases bread, butter and
milk, probably also purchases apples. We refer to the left- Storage Structures
hand side of the rule as body, and to the right-hand side as The example data from the Fig.1 can be stored in a
head. We also say, that the rule is satisfied by a given item database table with two attributes: one referring to item
values and one organizing items into sets or transactions.
Copyright © 1998, American Association for Artificial Intelligence Each item is stored in a separate record and the item set
(www.aaai.org). All rights reserved.
may consist of many records. Such structure allows
efficient storage of variable length sets of items. An and traditional database accessing methods. SQL language
example purchase data table is depicted in Fig 2. does not contain a subset search (or relational division)
clause, therefore, to specify a subset search query in SQL,
shopping aggregation or multiple join clauses are required. Tradi-
transaction_id item tional accessing methods (B+ tree, bitmap index, etc.) are
1 bread row-oriented, i.e. they reference single records. Therefore,
1 butter subset selection requires multiple use of the index. To
2 bread illustrate the subset search query, we present below two
2 butter examples of an SQL queries retrieving from a database
2 milk table data_table the identifiers of data item sets containing
2 apples four given items 0, 7, 12, and 13.
3 bread
1. select a.group_id Table: data_table
3 butter
from data_table a, data_table b,
3 milk
data_table c, data_table d group_id item
3 apples where a.group_id = b.group_id ----------- -----
Fig. 2 Table for storing supermarket purchase data and b.group_id = c.group_id 1 0
and c.group_id = d.group_id 1 7
Association rules that are mined in a knowledge and a.item = 0 and b.item = 7 1 12
discovery process can be also stored in database tables. and c.item = 12 and d.item = 13 1 13
Fig. 3 presents an example relational representation for 2 2
2. select group_id from data_table 2 4
association rules storage. The rule bodies and heads are where item in (0, 7, 12, 13) 3 10
placed in a separate table, while the second table keeps group by group_id 3 17
specific rule parameters (e.g. support and confidence). In having count(*) = 4 3 20
this example, two rules from the Fig. 1 are represented.
In general (and we will show it experimentally), finding
rules elements data item sets that contain a given subset is a complex and
rule_id supp. conf. rule_id item type time-consuming task. Therefore, to provide faster subset
1 0.83 0.90 1 bread body searching we propose a new indexing technique.
2 0.25 0.13 1 butter head
2 bread body
2 butter body Group Bitmap Index
2 milk body
In this Section we explain the idea of group bitmap
2 apples head
indexing. The aim of the group bitmap indexing is to
Fig. 3 Example tables for rule storage enable faster subset search and content-based association
rules retrieval in relational databases. The key idea of the
Notice the similarity between the representation of data group bitmap index is to build a binary key, called group
item sets and the representation of rule items. Both repre- bitmap key, associated with each item set. The group
sentations are based on storage of specific items together bitmap key represents a content of the item set. During
with an identifier of the set the items belong to. For the retrieval, the group bitmap keys are used to prune those
sake of simplicity, we will assume further in the paper that sets of items that do not contain the searched subset.
both items and set identifiers are positive integer numbers. In the following subsections we will introduce two types
of group bitmap indexes. The first one, called simple group
bitmap index will help us in explaining the basic ideas of
Queries the group bitmap index construction and utilization. Due to
We distinguish two basic types of queries that are usually its properties the simple group bitmap index has rather
issued in mining and searching of association rules: theoretical character. Therefore, in the next subsection, we
A. Retrieve all item sets that contain a given subset of will present a modified index structure, called hash group
items (to determine the sets that satisfy or violate the bitmap index, that efficiently supports subset search
specified rules). queries and may be easily implemented in practice.
B. Retrieve all rules that contain given subset of items in
their bodies or heads. Simple Group Bitmap Index
Notice, that both types of the queries consist in finding the The simple group bitmap index consists of a set of simple
sets of items that contain a given item subset. A similar group bitmap keys. The simple group bitmap key for a set
problem was studied in (Graefe and Cole 1995), where of items is a binary number, in which the bit value ‘1’ on
relational division operator was described. We will refer to the position k indicates that the set contains the value k.
this type of query as a subset search query. It is a set- The simple group bitmap keys are stored in an index table
oriented query that is not well supported by SQL interface together with identifiers of the sets they refer to. When a
subset search query is issued, a simple group bitmap key is change respectively. The maintenance of such index would
also computed for the searched subset of items. Then, the be costly and difficult. Therefore, the simple group bitmap
subset containment is checked by means of a fast bitwise index has rather theoretical character. However, in case of
machine operation, i.e. bitwise AND. The checking proce- a static database with a small number of possible items it
dure consists in testing, if for every bit set to ‘1’ in the sim- could be applicable.
ple group bitmap key of the searched subset, the correspon-
ding bit of a simple group bitmap key of an item set is also Hash Group Bitmap Index
set to ‘1’. The item sets whose simple group bitmap keys
satisfy the testing condition are returned as the result of the To eliminate the discussed disadvantages of the simple
subset searching. Figure 4 illustrates an example database group bitmap index, we introduce the hash group bitmap
table and the simple group bitmap index construction. index. This type of group bitmap index operates on hash
Three simple group bitmap keys were derived for item sets group bitmap keys, whose length is n, where n<<N. The
stored in the database table. The derived simple group bit- hash group bitmap key of an item set is created from the
map keys, together with corresponding item set identifiers, hash keys of all data items contained in a given item set, by
are stored in the index table. means of bitwise OR operation. The hash key of the item X
is an n-bit binary number defined as follows:
database table simple group simple group (X mod n)
bitmap keys bitmap index
hash_key(X) = 2
group item
bitmap key group The subset search with hash group bitmap index is
1 0
1 7 000011000010000001 1 performed as the two-step procedure. The first step, called
1 12 000000000000010100 2 filtering step, consists in scanning the index and finding the
1 13
000011000010000001
101000010000000000 3 identifiers of item sets that possibly contain the searched
2 2 subset. It is done as follows. When a subset search query is
2 4 issued, a hash group bitmap key is computed for the
00000000000010100
searched subset of items. Then, the hash group bitmap in-
3 10
3 15 dex is scanned and each hash group bitmap key is checked
3 17 against the searched subset key. The checking procedure is
10100001000000000
performed by means of the bitwise AND machine
Fig. 4 Simple group bitmap index operation, according to the following pseudo code:
X := hash_group_bitmap_key(searched_subset);
When a subset search query seeking for item sets for each (bitmap_key, group) from index table do
containing e.g. items 15 and 17 is issued, then the simple if (X AND bitmap_key = X) then return group;
group bitmap key for the searched subset is computed. This
key contains ‘1’s on positions 15 and 17. In the next step, The identifiers of the item sets whose hash group bitmap
by means of a bitwise AND, the index table is scanned for keys satisfy the testing condition are returned as the result
keys containing ‘1’s on the same positions. As the result, of the first step of the subset search procedure. However,
the item set identified by group=3 is returned. due to the fact that hash keys do not uniquely represent
Let us notice that the simple group bitmap keys have to items, the result of this step will possibly contain also false
be N-bit long, where N denotes the number of all possible sets of items, i.e. the sets that in fact do not contain the
items. In practice, N can be of order of hundreds or thou- searched subset. Therefore, the verification of the obtained
sands. It results in very long, space-consuming simple result is necessary. This is the aim of the second step,
group bitmap keys that are difficult to store as well as to called verification step. It simply consists in traditional se-
process. Moreover, since the database is dynamic in nature lecting from the item sets found in the previous step, those
that means that the number of possible items may change item sets that contain the searched subset.
in time, then the length of simple group bitmap keys should To illustrate the construction of a hash group bitmap

database table hash keys hash group hash group searched hash group hash group
bitmap keys bitmap index subset of items hash keys bitmap key bitmap index
group item
scanning
1 0 00001 bitmap key group 15 00001 bitmap key group

1 7 00100 01101 1 17 00100 01101 1


1 12 00100 10100 2 00101 AND 10100 2
1 13 01000 00001 3 00101 3
01101
2 2 00100
bitmap key group
2 4 10000
10100 01101 1
00001 00101 3
3 10
3 15 00001
3 17 00100
verify item sets: 1,3
00101

Fig. 5 Hash group bitmap indexing Fig. 6 Item sets retrieval with hash group bitmap index
index and its application to the subset search problem, let parameter value
us consider the examples presented in Fig. 5 and Fig. 6. In ntrans number of item sets, 50,000
the example in Fig. 5, three hash group bitmap keys were nitems number of different items, 100 to 500
derived for item sets stored in the database table. When a tlen average items per set, 15 to 30
subset search query seeking for item sets containing e.g. npats number of patterns, 500 and 10000
items 15 and 17 is issued, then the hash group bitmap key patlen average length of maximal pattern, 4
for the searched subset is computed (Fig. 6). Then, by
corr correlation between patterns, 0.25
means of a bitwise AND, the index table is scanned for
keys containing ‘1’s on the same positions. As the result of Table 1 Synthetic data parameters
the first step of the subset search procedure, the item sets
Fig. 7 shows the performance of traditional and hash
identified by group=1 and group=3 are returned. Then, in
group bitmap indexing methods for different sizes of a
the verification step, these item sets are tested for the
searched subset. Traditional B+ tree and bitmap indexes
containment of the items 15 and 17. Finally, the item set
show a rapid increase of query execution time for
identified by group=3 is the result of the subset search.
increasing searched subset size. The results of the search
with hash group bitmap index do not significantly depend
Experimental Results on the size of the searched subset. The cross-over between
the hash group bitmap indexing and traditional bitmap
The hash group bitmap indexing has been implemented on indexing (which appeared the best of the traditional
top of an Oracle 7.3.2 RDBMS (Sun SPARCserver methods) occurs when a subset of 5 (for 24-bit group
630MP, 128 MB RAM). Experimental data sets were crea- bitmap index) or 6 items (for 16-bit group bitmap index) is
ted by GEN generator from Quest project (Agrawal et al. searched. For the searched subset size of 10 items, the hash
1996). Several parameters shown in Table 1 affect the dis- group bitmap index allows data retrieval twice as fast as
tribution of the synthetic data. the traditional bitmap index.
The above experiment also showed that the best of

 
 

 !   
  

"  
   

"  
 !
 

"  
  
"    


 
 



 

 

  

 

 



               

           

Fig. 7 Query execution time vs. searched subset size Fig. 8 Query execution time vs. average item set size




  ! "   
  

  "       "   

"   



 



 




              
            

Fig. 9 Query execution time vs. number of items Fig. 10 Effectiveness of the filtering step
traditional database accessing methods is the one with the Conclusions and Future Work
bitmap index. Therefore, in the subsequent experiments,
we restrict our attention to the bitmap index, compared to In this paper we introduced the new index type, called
our hash group bitmap index. group bitmap index, that significantly reduce time of subset
Fig. 8 demonstrates the effect of the average size of item searching in large databases. This kind of searching has
sets on the performance of the hash group bitmap index many applications in the field of data mining and associa-
and the bitmap index. As it was expected, the larger the tion rules discovery. However, the new index may be also
average size of the sets, the longer hash group bitmap key applied in traditional database systems to speed-up the exe-
should be used. When the average size is greater than the cution of queries seeking for a subset of data items. We
hash group bitmap key length, the hash group bitmap index showed experimentally that the group bitmap index signi-
does not improve the query execution. The conclusion is ficantly outperforms traditional indexing methods inclu-
that the hash group bitmap key length should be slightly ding B+ tree and bitmap indexing. The experiment was led
greater than the average item set size. on top of a DBMS, using standard SQL interface, therefore
Fig. 9 demonstrates the effect of the number of items we believe that the results would be even better for the
stored in the database on the performance of the hash group bitmap index integrated into the core of DMBS.
group bitmap index and the bitmap index. For increasing
number of items, the performance of the hash group bitmap
index is getting relatively worse, compared to the traditio- References
nal bitmap index. The explanation of this behavior is that
Agrawal, R.; Imielinski, T.; Swami, A. 1993. Mining
more items are hashed to the same bits in the hash group
Association Rules Between Sets of Items in Large Databa-
bitmap key, thus increasing the number of false sets re-
ses. In Proceedings of the 1993 ACM SIGMOD Internatio-
turned after the first step of the retrieval procedure. The
nal Conference on Management of Data, Washington, DC.
false sets are removed from the result during the verifica-
tion step, which therefore consumes more time. The con-
Agrawal, R.; Srikant, R. 1994. Fast Algorithms for Mining
clusion is that the hash group bitmap key length should be
Association Rules. In Proceedings of the 20th VLDB
greater for data sets having more items.
Conference, Santiago, Chile.
Fig. 10 illustrates the effectiveness of the filtering step
of the subset search procedure. This effectiveness is
Fayyad, U.; Piatetsky-Shapiro, G.; Smyth, P. 1996. The
measured by the percentage of item sets pruned at this step. KDD Process for Extracting Useful Knowledge from
As it can be seen, with the increase of the size of the
Volumes of Data. Comm. of the ACM, Vol. 39, No. 11.
searched subset, the percentage of pruned sets increases.
For example, for the searched subset of the size 4-10, and
24-bit hash group bitmap index used, over 95% of all item Imielinski, T.; Manilla, H. 1996. A Database Perspective
sets are pruned. Besides, the difference between the effecti- on Knowledge Discovery. Communications of the ACM,
veness of 16-bit and 24-bit hash group bitmap index is Vol. 39, No. 11.
demonstrated. For the same data set, the 24-bit index
provides significantly better pruning than the 16-bit index. Piatetsky-Shapiro, G.; Frawley, W.J. editors 1991.
The conclusion is that pruning is performed better by hash Knowledge Discovery in Databases: MIT Press.
group bitmap indexes with longer hash group bitmap keys.
The explanation of this behavior results from the idea of Savasere, A.; Omiecinski, E.; Navathe, S. 1995. An
the hash group bitmap index. It can be shown that the Efficient Algorithm for Mining Association Rules in Large
number of bits set to ‘1’ in n-bit hash group bitmap key for Databases. In Proceedings of the 21st VLDB Conference,
an item set of the size L is given by the following formula: Zurich, Swizerland.
N −
 N
/
 Q
N /
+ ∑ − N −L   L /
L Toivonnen, H. 1996. Sampling Large Databases for Asso-
/ = ∑N 
L =
ciation Rules. In Proceedings of the 22nd VLDB Conferen-
N =  N  Q /
ce, India.
where: n is the number of bits of the hash group bitmap
key, L is the number of items represented by the hash Graefe, G.; Cole, R.L. 1995. Fast Algorithms for Universal
group bitmap key, / is the number of bits that are set to Quantification in Large Databases. ACM Transactions on
‘1’. With increasing L, for the constant n, the number of Database Systems, Vol. 20, No. 2.
items hashed to the same bits increase (e.g. for L = 4, / =
4, while for L = 16, / = 10, what means that six of ten bits Agrawal, R.; Mehta, M.; Shafer, J.; Srikant, R.; Arning, A.;
have to represent two different items each). Similar beha- Bollinger, T. 1996. The Quest Data Mining System. In Pro-
vior could be also observed for increasing the size of the ceedings of the 2nd International Conference on Knowledge
searched subset. So, it is clear that for n=16 selectivity of Discovery in Databases and Data Mining, Portland, Ore-
our hash group bitmap key is less that for the hash group gon.
bitmap key of n=24.

You might also like