0% found this document useful (0 votes)
6 views8 pages

Dsde 2011

This paper surveys various algorithms for extracting association rules, highlighting their advantages and disadvantages. It categorizes these algorithms into those based on frequent itemsets and those utilizing Formal Concept Analysis (FCA). The authors aim to provide a comprehensive overview of the state of the art in association rule extraction techniques to assist decision-makers in data mining.

Uploaded by

jvsjoaovictorjvs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views8 pages

Dsde 2011

This paper surveys various algorithms for extracting association rules, highlighting their advantages and disadvantages. It categorizes these algorithms into those based on frequent itemsets and those utilizing Formal Concept Analysis (FCA). The authors aim to provide a comprehensive overview of the state of the art in association rule extraction techniques to assist decision-makers in data mining.

Uploaded by

jvsjoaovictorjvs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/236838639

Algorithms of association rules extraction: State of the Art

Conference Paper · May 2011


DOI: 10.1109/ICCSN.2011.6014282

CITATIONS READS
0 297

2 authors:

Hamida Amdouni Mohamed Mohsen Gammoudi


ESEN - Manouba Manouba University
15 PUBLICATIONS 36 CITATIONS 94 PUBLICATIONS 466 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Mohamed Mohsen Gammoudi on 30 May 2014.

The user has requested enhancement of the downloaded file.


Algorithms of association rules extraction: State of the Art

AMDOUNI Hamida GAMMOUDI Mohamed Mohsen


PhD Student, Associate Professor,
Member of RIADI Laboratory. Member of RIADI Laboratory,
FST, Tunisia. ESSAI of Tunis, Tunisia
[email protected] [email protected]

Abstract—More than a decade, the task of generating II. BASIC NOTIONS


associative rules has received considerable attention by A. Extraction context of an associative rule
researchers because the great need of enterprise deciders to be
assisted by systems taking into account unknown knowledge Let k = (O, I, R) be a triplet with O and I are
extracted from a huge volume of data. In this paper, we respectively sets of objects (eg. transactions), sets of items
present a survey of the most known algorithms used for and R  O x I is a binary relation between objects and
associative rules extraction. We give a comparative study items.
between them and we show that they could be classified into
some categories. B. Table of co-occurrences
It indicates for each pair of items the number of co-
Keywords-Itemsets; Closed Itemsets;Frequent Itemsets;FCA; occurrences in the set of objects.
Associatives rules;
C. Itemset
I. INTRODUCTION It is a nonempty subset of items. An itemset consisting
of k elements is called k-itemset.

The extraction of association rules is one of the D. Support of an itemset


most important techniques of data mining [1]. It consists of The frequency of simultaneous occurrence of an itemset
extracting unknown knowledge (pattern) from a large (I’) in the set of objects called Supp(I’).
volume of data in order to help decision makers to take
E. Frequent itemset (FI)
efficient and profitable decisions.
Early approaches for extracting association rules are based FI is a set of items whose support ≥ a user-specified
on the generating the frequent Itemsets [2, 16, 8]. But, threshold called minsup. All its subsets are frequent. The set
following the considerable computing time of their of all frequent itemsets called SFI.
extraction, redundancy and irrelevance of rules generated, a F. Max Itemset
new approach is introduced [12]. It consists of extracting a
subset of generic non-redundant rules and without loss of An itemset which all its subsets are nonfrequents.
information [3] based on the mathematical foundations of G. Associative rule
Formal Concept Analysis (FCA) [7]. Any association rule having the following form: A  B,
This approach is based on extracting a subset of itemsets where A and B are disjoint itemsets with A is its premise
called closed itemsets, minimal generators and the relation (condition) and B is its conclusion.
between frequent itemsets. To our knowledge only
algorithm Prince [9] focuses on presenting the minimal H. Confidence
generators according to the partial order relation. Not taking The confidence of an association rule A  B measures
this relationship into account involves the generation of a how often items in B appear in objects that contain A.
large number of rules as in the case: A-Close [12], Closet Confidence(R) = Supp(A,B)/Supp(A) (1)
[13] and Charm [17].  Supp(A,B) : the number of objects that the itemset A
The main goal of this paper is to present a state of the art and the itemset B share.
of the most known algorithms in the literature that allow the  Supp(A) : le number of objects that contain A.
generation of association rules. Before presenting these Based on the degree of confidence, association rules can
algorithms, we recall the basic concepts necessary for their be classified as follows:
understanding.  Exact rule: rule which confidence = 1
 Approximative rule: rule which confidence < 1
 Valid rule: rule which confidence ≥ a user-specified In order to decrease the treatment time in the process of
threshold called minconf frequent itemsets extraction, [2] present a new version of
Apriori, called AprioriTid. It based on the count of the
After going through some basic notions, we present candidates support indexing the different transactions with
different algorithm of associative rules extraction by a its ID called TID.
chronological order while showing their advantages and We will present, in the following part, its general principle
disadvantages. as long as its advantages and its disadvantages.
2) AprioriTid [2]
III. ALGORITHM OF ASSOCIATIVE RULES EXTRACTION
It uses the same principle of Apriori to generate the
In the literature, we can find two categories of candidates but it make a different count of their supports.
algorithms, those which extract the associative rules from Firstly, it generates a set of candidates called C1
frequent Itemsets and those which are based on the Formal representing the database. This set includes elements (TID,
Concept Analysis (FCA) to generate a sub-set of frequent {c1}) which {c1} is the list of itemsets of size one in the
Itemsets called frequent closed Itemsets. transaction TID. When k>1, Ck is made using Ck-1. If one
A. Algorithm based on extraction of frequent itemsets element in Ck having an empty list of k-itemsets in the TID,
it is deleted. The support of the itemsets in ck is equal to the
1) Apriori [2]
number of occurrences of each one in the Ck.
Apriori is one of the first algorithms of associative rules In fact, in the first iterations, the set of candidates’
extraction. In what follows, we are going to present it itemsets can be huge which causes storage problem. In
principle, its advantages and disadvantages. addition, when k increases, the number of element in Ck
In fact, its principle consists of two steps, the first one is to become smaller than the transaction’s.
find the set of frequent itemsets (SIF) out of the initial 3) Partition [14]
extraction context and the second one uses this set to The objective of this algorithm is to reduce the number
determine the valid associative rules. of accesses to the initial context in two ways. It divides in p
partition D1, …, Dp. Each IF partition will be determined
a) Frequent itemsets search
during the first database access. the SFI of the extraction
To find the set of frequent itemsets (SIF), it proceeds in
context is the union of the different IF partitions. In the
repetitive way, as follows:
 Find the set of IF size one called L1 and out of this second one, it count the support of any element  SFI.
one generate another set called C2 which includes the As a summary, Partition does only two accesses to the
IF candidate size two. Any element of C2 having a database, it does not process the candidates Itemsets and the
support ≥ minsup becomes part of L2. support is calculated using the TID intersection unlike the
Apriori method.
 In general, any element of Ck+1 is the union of two IF
 Lk having k-1 as a commun element. A generated 4) Dic [5]
IF size k+1 is deleted of Ck+1 if at least one of its sub- Dic divides the database in M transactions masses. After
set of size k  Lk. going through one mass to determine the k-Itemsets
candidates and counts their supports, it generate the (k+1)-
 Its last process is repeated until Lk is empty. The
Itemsets out of the frequent k-Itemsets and its start to
result of this phase is the union of the different
determine their supports.
determined Lk
While studying Dic, we observe that the number of
b) Associative rules extraction databases accesses is less than Apriori as it processes the
To extract the associative rules out of the SIF found, it candidates having different sizes simultaneously. But, the
proceeds iteratively treating each Lk. storage of itemsets presents a problem and the cost of
In fact, a IF  Lk, the rules generated having the following supports count are more important than the necessary run
form: IF – C  C (C: set of conclusive items). Any rule time for Apriori.
having the confidence: Supp(IF)/Supp(C) minconf is 5) Eclat [16]
maintained. This algorithm uses the vertical format of the context
As a summary, the result generated by Apriori is clear extraction. Indeed, each item is associated with a Tidset (set
and easy to interpret. But, we can find many problems. First of all transactions containing this item).
one is the theoretical complexity  O(mn2m) [2] when n = It searches the set of frequent itemsets (SFI) of size one
|O| and m = |I|. The second one is the high number of and two using the horizontal format of the database and then
database accesses in order to extract the SIF, to count the uses the transversal depth which is based on the concept of
different supports and to generate the rules which generally equivalence classes (two k-itemsets belong to the same class
are redundant and little useful. In fact, to evaluate these if they have k-1 prefixes common. Each class will be treated
rules, the user’s intervention is required nevertheless it is separately).
expensive. In summary, the vertical format adopted by Eclat
reduces the calculation f the support of an itemset since it’s
a complete intersection of Tidsets. This allows results in the case of large volumes of data. Moreover, it is
automatically reduce the size of the database as transactions very difficult to determine the number of items in each rule.
involving only an itemset are used for the intersection.
B. Algorithm of associative rules extraction using FCA
In addition, this method can be parallelized, since each class
approch
can be treated separately to determine the frequent itemsets.
The problem is that this method is effective for small This strategy has two steps, the first one consists of
databases. However, the representation is not possible in the extracting the frequent closed itemsets based on the
case of large databases. mathematical foundations of the formal concept analysis
6) FP-Growth [8] (FCA) [7] and the second step consists of developing a
To avoid the repetitive context accesses, [8] suggest an generic base including the informative rules in order to
algorithm called FP-Growth (Frequent Pattern Growth) provide a useful and non redundant rules.
allowing the extraction of the SFI without generating In this part, we define some of the basic notions and we
candidates. present several algorithms which FCA approach.
It consists of compressing the database in a compact 1) Basic notions
stucture called FP-tree based on the notion of Trie [10] a) Galois connection: In A formal context K is a triplet
which is made of: K = (O, I, R), For every set of objects A ⊆ O, the set f(A) of
 A tree with no root, intermediate nodes containing attributes in relation R with the objects of A is as follow:
three pieces of information: the corresponding item, f(A) = { i ∈ I | oRi ∀ o ∈ A} (2)
its frequency and a pointer to the next node in the
Dually, for every set of attributes B ⊆ I, the set g(B) of
tree.
objects in relation R with the attributes of B is as follow:
 A list called index containing the list of frequent
g(B) = { o ∈ O | oRi ∀ i ∈ B} (3)
items. Each item is associated with a pointer to the
first node of the tree containing this item. The two functions f and g defined between objects and
The construction of this structure requires two accesses attributes form a Galois connection.
to the extraction context: The operators f ° g(B) and g ° f(A) called φ are the closure
 The first to determine the frequent items and save operators.
them in the index list sorted in descending order of φ verifies the following properties X, Y  I (resp. X1,
supports. Y1 O):
 The second allows building the FP-tree knowing  Idempotent : φ2(X) = φ(X), (4)
that the items in a transaction are stored according to  extensive : X ⊆ φ(X), (5)
their order in index  monotone : φ(X)  φ(Y). (6)
A root is created, from what we model branch for b) Frequent Closed Itemset (FCI): An Itemset I’ is
various transactions. called closed if I’ = φ(I’). In other words, an itemset I’ is
A transaction will be presented with a list of nodes closed if the intersect of the objects to which I’ belongs is
containing the item, its frequency and a pointer to equal to I’ and it is frequent if its support ≥ minsup. SFCI is
the next node (transaction with the same prefix will the Set of Frequent Closed Itemset.
be presented by the same branch, same transaction
will be submitted one) c) Minimal Generator: An Itemset c ⊆ I is a closed
After packing the database, it will be divided into sub Itemset generator I’ ⇔ φ(c) = I’. c is a minimal frequent
projections called conditional basis. Each of these bases is generator if its support is ≥ minsup.
associated with a frequent item. The extraction of the The set of frequent minimal generators of I’ called GMFI’’
frequent itemsets is done on each of the projections. GMFI’ = {c ⊆ I | φ(c) = I’  ∄ c1 ⊂ c as φ(c) = I’} (7)
Using FP-Growth, the frequent items are sorted in a d) Negative border (GBd-): the set on no-frequents
decreasing order, implying that the most frequent items are minimal generator.
near the root and are better shared by transaction.
e) Positive border (GBd+) : Let GMFk is the set of all
FP-tree is thus a compact representation and interesting. In
addition, if the prefixes are shared by the transaction, the minimal frequent generators:
suffixes are not, that is why the number of nodes in FP-tree GBd+= {c|c ≥ minsup  c  GMFk  c’  c, c’ GMFk} (8)
is reducing. But it should be noted that there is a storage f) Equivalent classes: The closure operator φ divides
problem if the FP-tree size becomes important [6]. the set of frequents Itemsets into disjoint equivalent classes
7) Conclusion including elements having the same support. The largest
After presenting some algorithms for generating element in a given class is an FCI called I’ and smaller ones
association rules based on extraction of frequent itemsets, are the GMFI’ [12].
we can conclude that using this method presents some
g) Comparable equivalent classes: The classes Ci et Cj
advantages such as easier understandings of the process of
are only said comparable if FCI of Ci covers that of Cj.
calculation adopted, clarity and ease of application of result.
But it is also clear that it does not provide satisfactory
The five following notions are defined in [7]: valid rules are generated using the frequent itemsets derived
h) Formal concept: a formal concept is a maximal from the frequent closed itemsets selected.
objects-attributes subset where objects and attributes are in It should be noted that this step is costly and can generate
relation. More formally, it is a pair (A, B) with A  O and B many redundant and irrelevant rules.
 I, which verifies f(A) = B and g(B) = A. A is the extent b) A-Close[12]: to remedy the problem addressed in
of the concept and B is its intent. the algorithm Close. [12] have proposed a new algorithm
called A-Close. It is to generate frequent closed itemsets
i) Partial order relation between concepts≤: The
from the associated minimal generators.
partial order relation called ≤ is defined as follow: for two
formal concepts (A1, B1) and (A2, B2): (A1, B1) ≤ (A2, B2)  c) Titanic [15]: the algorithm Titanic makes a breadth
A2  A1 and B1 B2. at each level to extract the frequent minimal generators and
then deduct the frequent closed itemsets. To determine the
j) Meet / Join : for each concepts (A1, B1) and (A2, B2), k-generators, a self-join of (k-1)-generators is carried out
it exist a greatest lower bound (resp. a least upper bound) and if one of subsets of a given generator is not frequent,
called Meet (resp. Join) denoted as ((A1, B1)  (A2, B2) then it will be deleted.
(resp. (A1, B1)  (A2, B2)) defined by:
In addition, to reduce the computing time of the support
(A1, B1)  (A2, B2) = (g(B1 ∩ B2), (B1 ∩ B2)) (9) of k-generators, estimated support is associated with each of
(A1, B1)  (A2, B2) = ((A1 ∩ A2), f(A1 ∩ A2)) (10) them and it is equal to the minimum value of the supports of
k) Galois lattice: The Galois lattice associated to a two (k-1)-generators that form. For the case of the empty
formal context K is a graph composed of a set of formal set, its estimated supporter is equal to the cardinality of
concepts equipped with the partial order relation ≤. This context extraction. Every k-generator with a lower support
graph is a representation of all the possible maximal to the minsup or equal to real support will be removed from
correspondences between a subset of objects O and a subset the set. It is noteworthy that in the worst case, Titanic
of attributes I. performs an average number of accesses to the extraction
l) Frequent minimal generators lattice: A partial context equal to the maximum size of lists of candidates’
ordered structure of which each equivalent class includes generators.
the appropriate frequent minimal generators [4]. d) Closet [13]: this algorithm has two stages, the first is
m) Iceberg lattice: A partial ordered structure of to use the FP-tree structure to present the search space as a
frequent closed Itemsets having only the join operator. It is tree by eliminating non-frequent itemsets. The second step
considered a superior semi-lattice [15]. is to use this structure to extract frequent closed itemsets by
n) Generic base of exact associative rules: It is a base dividing the tree into subtrees, called sub-conditional
composed of non-redundant generic rules having a contexts in order to make an in-depth exploitation of search
confidence ratio equal to 1 and called GB [3]. Given a space. This implies that this algorithm requires two accesses
context (O, I, R), the set of frequent closed itemsets (SFCI) to the context extraction, the first is to extract the list of
and the set of minimal generators GMFk : frequent 1-itemsets and the second is to build the FP-tree.
GB = {R: g  (c - g)|c  (SFCI) g  (GMFk), g  c} (11) Sub-contexts of frequent 1-itemsets will be processed
o) Informative base of approximative associative rules: fist in order of increasing support. Each contains only items
it is called IB and defined as follows: that collocate with 1-itemset in questin, called I’, and have a
support above minsup. The frequent closed itemsets
IB = {R: XY, Y SFCI, ƒ(X)≤Y, confidence(R) ≥
corresponding concatenation of I’ with all the items having
minconf, Supp(Y)≥ minsup} (12)
the same support. The constructing process of the sub-
2) Extracting frequent closed itemsets: the algorithms contexts continues recursively knowing that an item treaty
After presenting some basic notions, we introduce a set will be excluded since all the frequent closed itemset
of algorthms designed to extract frequent closed itemsets. containing it are already generated. In addition, to reduce
a) Close [12]: this algorithm iterate the search space to extraction, a sub-context of an itemset will be build only if it
extract minimal generators subsequently used to extract is not covered by any frequent closed itemsets generated.
frequent closed itemsets associated. This algorithm has some disadvantages, the first one is
Each iteration involves two steps. The first is to do a the processing cost of sorting the initial context with a
self-join between the minimal generators found in the decreasing value of the support. The second is to store a
large number of sub-contexts in division recursive initial
previous iteration to form a noted MGCk (k-Minimal
context. In addition, the fact of whether a given itemset is
Generators Candidates), each element consists of triplet
included or not in one part of the list of frequent closed
(minimal generator, closed itemsets candidate: CIC, support
itemsets found, requires the maintenance of this list in
of CIC). The second step is to prune the MGCk eliminating
memory throughout the treatment.
the triplet whose support of closed itemset > minsup. The
It should be noted that this algorithm does not manage a are fewer non-redundant rules and without loss of
list of candidates, but in the case of low support and a information.
context scattered, the number of itemsets included frequent First, Prince extracts minimal generators of the initial
closed itemsets generated is very small, which leads to both context ( is the first generator to extract, its support is
under construction sub-contexts that itemset. equal to the cardinality of the initial context) . It determines,
e) Charm [17]: in order to reduce the number of all k-generator candidates and by doing this, at each level k,
accesses to the extraction context as well as candidates a self-join of (k-1)-generators. Subsequently, it eliminates
generated for the extraction of frequent closed itemsets, any candidate of size k if at least one of its subsets is not a
Charm using both the set of closed itemsets and identifiers minimal generator or if its support is equal to one of them.
of transactions to which they belong (Tidset). It uses, GMFk is the union of all sets of frequent minimal generators
therefore, a structure called the IT-tree (Itemset-Tidset Tree) determined in each level, while no-frequents form the GBd-.
where each node represents a pair of the form (frequent
Second, GMFk and GBd- be used to form a minimal
closed itemset candidate, Tidset).
generators lattice and this by comparing each minimal
First, the 1-itemsets are added to the structure by generator g to the list L of immediate successors of its
decreasing support. Subsequently, a course will be made subsets of size k-1. If L is empty then g is added to this list,
from the root from left to right and depth, to determine the otherwise, four cases are possible for each g1 L knowing
frequent closed itemsets. that Cg and Cg1 are the equivalence classes of g and g1:
For every two closed itemset candidates I1 and I2 have  If (gg1) is a minimal generator then Cg and Cg1 are
the same parents, are four cases to check: no-comparables.
 If Tidset(I1) = Tidset(I2) then φ(I1) = φ(I2) = φ(I1I2)  If Supp(g) = Supp(g1) = Supp(gg1) then g and g1
all occurrences of I1 replaced by (I1I2) and I2 will a same class.
be removed from the tree since its closure is the  If Supp(g) < Supp(g1) = Supp(gg1) then g become
same as (I1I2). the successor of g1.
 If Tidset(I1)  Tidset(I2) then φ(I1)  φ(I2) and φ(I1)  If Supp(g) < Supp(g1)  Supp(gg1) then Cg and Cg1
= φ(I1I2), all occurrences of I1 are replaced by are no-comparables.
(I1I2) but not I2 will be removed from the tree If (gg1) is not a minimal generator, the calculation of its
because it can be a generator of another frequent support is performed by applying this proposal [15]:
closed itemsets. Let GMk = GMFk  GBd- (set of all generators), an itemset I’
 If Tidset(I1)  Tidset(I2) then φ(I1)  φ(I2) and φ(I2) is no-generator if:
= φ(I1I2), I2 iis removed from the tree and a node of Supp(I’) = min{Supp(gi)| gi  GMk  gi  I’} (13)
the form ((I1I2), (Tidset(I1)Tidset(I2))) is added to The research process of Supp(gg1) stops whene one of its
the list of descendants (I1).
subsets has a strictly lower support than that of g and g1
 If Tidset(I1) ≠ Tidset(I2) then φ(I1)  φ(I2)  φ(I1I2),
because this implies that Cg and Cg1 are no-comparables.
the occurrences of I1 and I2 does not change and a
node of the form ((I1I2), (Tidset(I1)Tidset(I2))) is After constructing the minimal generators lattice, Prince
added to the list of descendants (I1) if (I1I2) is determines for each equivalence class, starting from C to
frequent, otherwise the tree remains the same. the top, the frequent closed itemset and built the Iceberg
It should be noted that Charm performs a single access to lattice by applying this proposal:
the context extraction to determine the lists of transactions 1- Let I1 and I2 are two frequent closed itemsets such as I1
itemsets. In addition, to reduce memory usage, it performs covers I2 by the partial order relation and GMFI1 is the set of
incremental storage of Tidsets using a data representation the frequent minimal generator of I1:
called Diffset as a Tidset of the candidate is the difference I1 = {g | g GMFI1} I2 (14)
between his intention and that of his immediate parent. In The two lattices are used to extract exact and
addition, this reduces the processing time, since the approximative rules. Note that the rules with confidence = 1
determination of the intersection of two candidates I1 and I2 are exact and an implications extracted from each node
is nothing else than the result of the difference between (intra-node). Whereas, the approximative rules have
Tidset(I1) and that of its parent P, one hand, and that between confidence ≥ minconf are implications involving two
Tidset(I2) and that of its parent P, on the other. But, despite comparable equivalence classes. These rules are implications
this, Charm is considered among the algorithms that between nodes.
consume lots of memory. As proved in [9] all generated rules are no redundant and
f) Prince [9]: this algorithme can extract the minimal guarantee that there is no loss of information. But it should
generators and build a structure partially ordered called be noted that the complexity of the first step (the extraction
of frequent minimal generators) is exponential, which
Frequent minimal generators lattice in order to perform a
implies an overall processing time high in the case of
vertical scan (bottom to top) to find the frequent closed scattered contexts.
itemsets and then extract the informative association rules
g) GrGrowth [11]: this algorithm is developed in order V. CONCLUSION
to mine frequent generators and possitive border. In this paper we presented a state of the art of the most
It uses the compact data structure FP-tree and adopts knowing algorithms of extraction association rules. We
the pattern growth approach. It constructs a conditional observe that algorithms based on the set of Frequent
database for each frequent generator. Itemsets generation could be used easily for huge database.
The algorithm uses the depth-first-search strategy to explore However, they present some limits such as: redundancy and
the search space, which is much more efficient than the less useful associative rules. The second kind of algorithms
breadth-first-search strategy adopted by most of the existing based on FCA approach could reduce the number of
generator mining algorithms. associative rules with the advantage of their relevance. But,
GrGrowth prunes a positive border during the mining the cost in memory need and the runtime increases
process to save mining cost. In fact, generator based especially when the formal context is sparse.
representations rely on a negative border to make the
representation lossless. However, the number of itemsets on
REFERENCES
a negative border sometimes exceeds the total number of
[1] R. Agrawal, T. Imielinski and A. N. Swami, “Mining association
frequent itemsets and positive border is usually smaller than rules between sets of items in large databases”, In Proceedings of
it. In addition, a set of frequent generators plus its positive the International Conference on Management of Data, ACM
border is always no larger than the corresponding complete SIGMOD’93, Washington, D.C., USA, page 207-216, May 1993.
set of frequent itemsets. [2] R. Agrawal and R.Srikant, “Fast algorithms for mining association
rules”, In J. B. Bocca, M. Jarke and C. Zaniolo, editors, Proceedings
3) Conclusion of the 20th International Conference on Very Large Databases,
This approach has two advantages. Firstly it reduces the Santiago, Chile, p.p. 478-499, June 1994.
run time and storage space by contributing to the first [3] Y. Bastide, N. Pasquier, R. Taouil, L. Lakhal and G . Stumme,
approach because the number of frequent closed itemsets “Mining minimal non-redundant association rules using frequent
and the processing time is smaller than that of extraction of closed itemsets”, Proceedings of the Intl. Conference DOOD’2000,
LNCS, Springer-verlag, July 2000, p. 972-986.
all frequent itemsets. Secondly, the reduced number of
[4] S. Ben yahia, C. Latiri, G.W. Mineau and A. Jaoua, “Découverte des
association rules extracted without loss of information. règles associatives non redondantes – application aux corpus
textuels”, In M.S. Hacid, Y. Kodrattof and D. Boulanger, editors
IV. COMPARATIVE STUDY EGC, volume 17 of Revue des Sciences Technologies de
l’Information – série RIA ECA, pages 131-144. Hermes Sciences
In this section, we have identified four characteristics in Publications, 2003.
order to classify the algorithms already presented. The first [5] S. Brin, R. Motwani, J.D. Ullman and S. Tsur, “Dynamic itemset
characteristic is the computational complexity. The second counting and implication rules for market basket data”, In :
is the kind of itemsets extraction (FI or FCI). Third is the Proceedings ACM SIGMOD International Conference on
strategy of association rules extraction from initial context, Management of Data, Tucson, Arizona, USA, éd. par Peckham
(Joan). pp. 255-264 - ACM Press, 1997.
knowing that there are three types:
[6] W. Cheung, W. Heung and O. Zaiane, “ Incremental Mining of
 Test-and-build: is to browse the database level, Frequent Patterns Without Candidate Generation or Support
generating a set of candidates associated with each Constraint‖, Proceedings of the Seventh International Database
level and apply some metrics to reduce the result set Engineering and Applications Symposium (IDEAS 2003), Hong
Kong, China, July 2003.
called pruning.
[7] B. Ganter and R. Wille, “Formal Concept Analysis”, Mathematical
 Divide-and-Build: is to divide the database into Foundations, Springer, 1999.
subsets and apply to each subset the process of [8] J. Han, J. Pei and Y. Yin, “Mining frequent patterns without
extracting itemsets in order to reduce the number of candidate generation”, CM-SIGMOD Int. Conf. on Management of
candidates. As already mentioned in the first Data, pp. 1-12, Mai 2000.
category, a pruning step is performed. [9] T. Hamrouni, S. Ben Yahia and Y. Slimani, “Prince: An algorithm
for generating rule bases without closure computations”, In 7th
 Hybrid: is to deal in depth the database but without International Conference on Data Warehousing and Knowledge
division. In addition, to reduce the set of closed discovery (DaWaK’05), pages 346-355, Copenhagen, Denmark,
itemsets result, a statistical metric and heuristics are 2005. Springer-Verlag, LNCS.
used. [10] R. L. Kruse and A. J. Ryba, “Data structures and program design in
The latter feature is called ―Data structures‖ which includes c++”, Prentice Hall, 1999.
two sub classes: the structure used to represent the initial [11] G. Liu, J. Li and L. Wong, “A new concise representation of
frequent Itemsets using generators and a positive border”,
context, such as Tidsets, FP-tree and IT-tree, and the one Knowledge and Information Systems, 17(1) : 35-56, 2008.
used for storing the itemsets results such as hash-table, [12] N. Pasquier, Y. Bastide, R. Taouil, L. Lakhal, “Efficient Mining of
hash-tree, Trie and sparse matrix. (See Table I in the last Association Rules Using Closed Itemset Lattices”, Information
page). Systems Journal, vol. 24, no 1, 1999, p. 25-46.
[13] J. Pei, J. Han, R. Mao, S. Nishio, S. Tang and D. Yang, “CLOSET : An
efficient algorithm for mining frequent closed Itemsets”,
Proceedings of the ACM SIGMOD DMKD’00, Dallas, TX, 2002, p.
21-30.
[14] A. Savasere, E. Omiecinsky et S. Navathe, “An efficient algorithm Knowledge Discovery and Data Mining, éd. par Heckerman (D.),
for mining association rules in large databases”, 21st Int'l Conf. Mannila (H.), Pregibon (D.), Uthurusamy (R.) et Park (M.). pp.
on Very Large Databases (VLDB), Septembre 1995. 283-296. AAAI Press, 1997.
[15] G. Stumme, R. Taouil, Y. Basride, N. Pasquier and L. Lakhal, [17] M. Zaki and C. J. Hsiao, “CHARM : An Efficient Algorithm for Closed
“Computing Iceberg Concept Lattices with TITANIC”, J. on Itemset Mining”, Proceedings of the 2nd SIAM International
Knowledge and Data Engineering (KDE), vol. 2, no 42, 2002, p. Conference on Data Mining, Arlington, April 2002, p. 34-43.
189-222.
[16] M. Zaki, S. Parthasarathy, M. Ogihara and W. Li, “New algorithms
for fast discovery of association rules”, In : 3rd Intl. Conf. on

TABLE I. COMPARATIVE STUDY OF EXTRACTION ASSOCIATIVE RULES ALGORITHM

Strategy of association rules extraction


Data Structure
Itemsets from initial context
Algorithm Complexity Extracts Test-and- Divide-and- Hybrid Database Structure of Itemsets
Generat Generate structure storage
Apriori O(mn2m) FI X - hash-tree
m
AprioriTid O(mn2 ) FI X Tidsets hash-tree
m
Partition O(2 P)* FI X - hash-tree
Dic O(2mM)** FI X - Trie
2 2
Elat O(n m ) FI X Tidsets Sparse Matrix
3
FP_growth O(mn + m ) FI X FP-tree hash table
Close O(2mn2m) FCI X - Trie
m 2
A-Close O(2 n m) FCI X - Trie
m 2
Titanic O(2 n m) FCI X - Trie
Closet O(mn + m4) FCI X FP-tree Trie
m 2
Charm O(2 +n ) FCI X IT-Tree Trie
m
Prince O(2 ) FCI X - Trie
Gr-Growth O(mn+m4) FCI X FP-tree hash table

* P: number of partition. **M: number of transactions

View publication stats

You might also like