Base Knowledge Based
Base Knowledge Based
1 INTRODUCTION
A
SSOCIATION rule mining, introduced in [1], is considered
as one of the most important tasks in Knowledge
Discovery in Databases [2]. Among sets of items in transac-
tion databases, it aims at discovering implicative tendencies
that can be valuable information for the decision-maker.
An association rule is defined as the implication X Y ,
described by two interestingness measuressupport and
confidencewhere X and Y are the sets of items and
X Y = O. Apriori [1] is the first algorithm proposed in the
association rule mining field and many other algorithms
were derived fromit. Starting froma database, it proposes to
extract all association rules satisfying minimum thresholds
of support and confidence. It is very well known that mining
algorithms can discover a prohibitive amount of association
rules; for instance, thousands of rules are extracted from a
database of several dozens of attributes and several
hundreds of transactions. Furthermore, as suggested by
Silbershatz and Tuzilin [3], valuable information is often
represented by those rarelow supportand unexpected
association rules which are surprising to the user. So, the
more we increase the support threshold, the more efficient
the algorithms are and the more the discovered rules are
obvious, and hence, the less they are interesting for the user.
As a result, it is necessary to bring the support threshold
low enough in order to extract valuable information.
Unfortunately, the lower the support is, the larger the
volume of rules becomes, making it intractable for a
decision-maker to analyze the mining result. Experiments
show that rules become almost impossible to use when the
number of rules overpasses 100. Thus, it is crucial to help
the decision-maker with an efficient technique for reducing
the number of rules.
To overcome this drawback, several methods were
proposed in the literature. On the one hand, different
algorithms were introduced to reduce the number of
itemsets by generating closed [4], maximal [5] or optimal
itemsets [6], and several algorithms to reduce the number of
rules, using nonredundant rules [7], [8], or pruning
techniques [9]. On the other hand, postprocessing methods
can improve the selection of discovered rules. Different
complementary postprocessing methods may be used, like
pruning, summarizing, grouping, or visualization [10].
Pruning consists in removing uninteresting or redundant
rules. In summarizing, concise sets of rules are generated.
Groups of rules are produced in the grouping process; and
the visualization improves the readability of a large number
of rules by using adapted graphical representations.
However, most of the existing postprocessing methods
are generally based on statistical information in the
database. Since rule interestingness strongly depends on
user knowledge and goals, these methods do not guarantee
that interesting rules will be extracted. For instance, if the
user looks for unexpected rules, all the already known rules
should be pruned. Or, if the user wants to focus on specific
schemas of rules, only this subset of rules should be
selected. Moreover, as suggested in [11], the rule post-
processing methods should be imperatively based on a
strong interactivity with the user.
The representation of user knowledge is an important
issue. The more the knowledge is represented in a flexible,
784 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
. The authors are with KOD TeamLINA CNRS 6241, PolytechNantes
Site de la Chantrerie, Rue Christian Pauc, BP 50609, 44306 Nantes cedex 3,
France. E-mail: {claudia.marinica, fabrice.guillet}@univ-nantes.fr.
Manuscript received 31 Mar. 2009; revised 23 Sept. 2009; accepted 7 Nov.
2009; published online 4 Feb. 2010.
Recommended for acceptance by C. Zhang, P.S. Yu, and D. Bell.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number
TKDESI-2009-03-0275.
Digital Object Identifier no. 10.1109/TKDE.2010.29.
1041-4347/10/$26.00 2010 IEEE Published by the IEEE Computer Society
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
expressive, and accurate formalism, the more the rule
selection is efficient. In the Semantic Web
1
field, ontology is
considered as the most appropriate representation to
express the complexity of the user knowledge, and several
specification languages were proposed.
This paper proposes a new interactive postprocessing
approach, ARIPSO (Association Rule Interactive post-Proces-
sing using Schemas and Ontologies) to prune and filter
discovered rules. First, we propose to use Domain Ontologies
in order to strengthen the integration of user knowledge in
the postprocessing task. Second, we introduce Rule Schema
formalism by extending the specification language proposed
by Liu et al. [12] for user beliefs and expectations toward the
use of ontology concepts. Furthermore, an interactive and
iterative framework is designed to assist the user throughout
the analyzing task. The interactivity of our approach relies on
a set of rule mining operators defined over the Rule Schemas in
order to describe the actions that the user can perform.
This paper is structured as follows: Section 2 introduces
notations and definitions used throughout the paper.
Section 3 justifies our motivations for using ontologies.
Section 4 describes the research domain and reviews related
works. Section 5 presents the proposed framework and its
elements. Section 6 is devoted to the results obtained by
applying our method over a questionnaire database.
Finally, Section 7 presents conclusions and shows directions
for future research.
2 NOTATIONS AND DEFINITIONS
The association rule mining task can be stated as follows:
let I = i
1
; i
2
; . . . ; i
n
be a set of literals, called items. Let
D = t
1
; t
2
; . . . ; t
m
be a set of transactions over I. A
nonempty subset of I is called itemset and is defined as
X = i
1
; i
2
; . . . ; i
k
. In short, itemset X can also be denoted
as X = i
1
i
2
. . . i
k
. For an itemset, the number of items is
called length of the itemset and an itemset of length k is
referred to as k-itemset. Each transaction t
i
contains an
itemset i
1
i
2
. . . i
k
, with a variable k number of items for
each t
i
.
Definition 1. Let X _ I and T _ D. We define the set of all
transactions that contain the itemset X as:
t : I D; t(X) = t D [ X _ t:
Similarly, we describe the itemsets contained in all the
transactions T by:
i : D I; i(T) = x I [ \t T; x t:
Definition 2. An association rule is an implication X Y ,
where X and Y are two itemsets and X Y = O. The former,
X, is called the antecedent of the rule, and the latter, Y , is
called the consequent.
A rule X Y is described using two important
statistical factors:
. The support of the rule, defined as supp(X Y ) =
supp(X Y ) = [t(X Y )[, is the ratio of the number
of transactions containingX Y . If supp(X Y ) = s,
s % of transactions contains the itemset X Y .
. The confidence of the rule, defined as conf(X Y ) =
supp(X Y )=supp(X) = supp(X Y )=supp(X) = c,
is the ratio (c %) of the number of transactions that,
containing X, contain also Y .
Starting from a database and two thresholds minsupp and
minconf for the minimal support and, respectively, the
minimal confidence, the problem of finding association
rules, as discussed in [1], is to generate all rules that have
support and confidence greater than the given thresholds.
This problem can be divided into two main problems:
. first, all frequent itemsets are extracted. An itemset X
is called frequent itemset in the transaction database D
if supp(X) _ minsupp;
. and then, for each frequent itemset X, the set of rules
X Y Y , with Y _ X, and satisfying conf(X
Y Y ) _ minconf is generated.
If X is frequent and no superset of X is frequent, X is
denoted as a maximal itemset.
Theorem 1. Let X _ I and T _ D. Let c
it
(X) denotes the
composition of the two mappings t i(X) = i(t(X)). Also,
let c
ti
(T) = i t(T). Then, c
it
and c
ti
are both Galois
closure operators [13] on itemsets and sets of transactions,
respectively.
Definition 3. A closed itemset [14] is defined as an itemset X
which has the property of being the same as its closure, i.e.,
X = c
it
(X). The minimal closed itemset containing an itemset
Y is obtained by applying the closure operator c
it
to Y .
Definition 4. Let R
1
and R
2
be two association rules. We say that
rule R
1
is more general than rule R
2
, denoted R
1
_ R
2
, if R
2
can
be generated by adding additional items to either the antecedent
or consequent of R
1
. In this case, we say that a rule R
j
is
redundant [15] if there exists some rule R
i
such that R
i
_ R
j
.
In consequence, in a collection of rules, the nonredundant
rules are the most general ones, i.e., those rules having minimal
antecedents and consequents, in terms of subset relation.
Definition 5. A rule set is optimal [6] with respect to an
interestingness metric if it contains all the rules except those with
no greater interestingness than one of its more general rules. An
optimal rule set is a subset of a nonredundant rule set.
Definition 6. Formally, an ontology is a quintuple O =
C; R; I; H; A [16]. C = C
1
; C
2
; . . . ; C
n
is a set of concepts
and R = R
1
; R
2
; . . . ; R
m
is a set of relations defined over
concepts. I is a set of instances of concepts and H is a Directed
Acyclic Graph (DAG) defined by the subsumption relation (is-a
relation, _ ) between concepts. We say that C
2
is-a C
1
,
C
1
_ C
2
, if the concept C
1
subsumes the concept C
2
. Ais a set of
axioms bringing additional constraints on the ontology.
3 MOTIVATIONS FOR THE GENERAL IMPRESSION
IMPROVEMENT USING ONTOLOGIES
Since early 2000s, in the Semantic Web context, the number
of available ontologies has been increasing covering a wide
domain of applications. This could be a great advantage in
an ontology-based user knowledge representation.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 785
1. http://www.w3.org/2001/sw/.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
This paper contributes on several levels at reducing the
number of association rules. One of our most important
contributions relies on using ontologies as user background
knowledge representation. Thus, we extend the specifica-
tion language proposed by Liu et al. [17]General
Impressions (GI), Reasonably Precise Concepts (RPC), and
Precise Knowledge (PK)by the use of ontology concepts.
Example. Let us consider the case of General Impressions.
The user might believe that there exist some associations
among milk OR cheese, Fruit items, and beef (assume that
the user uses the taxonomy in Fig. 1). He/she could
specify his/her beliefs using General Impressions:
gi(<milk; cheese+; Fruit; beef>):
The following rules are examples of association rules
that are conform to the specification:
apple beef
grape; pear; beef milk:
But when working directly with database items, devel-
oping specifications becomes a complex task. Moreover, it
will be very useful for the user to be able to introduce in the
GI language interesting additional information. For exam-
ple, in the market case, it would be very useful to be able to
find if the customers buying diet products, also buy
ecological products. In order to select this type of rules,
the user should be able to create an RPC such as:
rpc(<DietProducts EcologicalProducts>);
where DietProducts and EcologicalProducts represent,
respectively, the set of the products integrated in diets, and
those products which are produced in an ecological way.
Defining such concepts is not possible using taxonomies.
Starting from the taxonomy presented in Fig. 1, we
developed an ontology based on the earlier considerations.
We propose to integrate two data properties of Boolean type
in order to define the products that are useful in diets (IsDiet),
and those that are ecological (IsEcological). Description logic
[18], used in designing ontologies, allows concept definition
using restrictions on properties. Therefore, the concept
DietProducts is defined as a restriction on FoodItem hierarchy
using the data property isDiet, describing the items useful in
a diet. Similarly, we define EcologicalProducts concept.
In our example, apple and chicken items are diet products,
and milk, grape, and beef items are ecological products. In
Fig. 2, we present the structure of the ontology resulting after
applying a reasoner, and the ontology construction is
detailed in Section 5.3.
4 RELATED WORK
4.1 Concise Representations of Frequent Itemsets
Interestingness measures represent metrics in the process of
capturing dependencies and implications between database
items, and express the strength of the pattern association.
Since frequent itemset generation is considered as an
expensive operation, mining frequent closed itemsets (pre-
liminary idea presented in [4]) was proposed in order to
reduce the number of frequent itemsets. For example, an
itemset X is denoted as closed frequent itemset if
, itemset X
/
_ X so that t(X) = t(X
/
). Thus, the number
of frequent closed itemsets generated is reduced in
comparison with the number of frequent itemsets.
The CLOSET algorithm was proposed in [19] as a new
efficient method for mining closed itemsets. CLOSET uses a
novel frequent pattern tree (FP-tree) structure, which is a
compressed representation of all the transactions in the
database. Moreover, it uses a recursive divide-and-conquer
and database projection approach to mine long patterns.
Another solution for the reduction of the number of
frequent itemsets is mining maximal frequent itemsets [5]. The
authors proposed the MAFIA algorithm based on depth-first
traversal and several pruning methods as Parent Equiva-
lence Pruning (PEP), FHUT, HUTMFI, or Dynamic Record-
ing. However, the main drawback of the methods extracting
maximal frequent itemsets is the loss of information
because the subset frequency is not available; thus,
generating rules is not possible.
4.2 Redundancy Reduction of Association Rules
Conversely, generating all association rules that satisfy the
confidence threshold is a combinatorial problem.
Zaki and Hsiao used frequent closed itemsets in the
CHARM algorithm [20] in order to generate all frequent
closed itemsets. They used an itemset-tid set search tree and
pursued with the aim of generating a small nonredundant
rule set [7]. To this goal, the authors first found minimal
generator for closed itemsets, and then, they generated
nonredundant association rules using two closed itemsets.
Pasquier et al. [8] proposed the Close algorithm in order to
extract association rules. Close algorithm is based on a new
mining method: pruning of the closed set lattice (closed itemset
lattice) in order to extract frequent closed itemsets. Associa-
tion rules are generated starting from frequent itemsets
generated from frequent closed itemsets. Nevertheless, Zaki
and Hsiao [20] proved that their algorithm CHARM out-
performs CLOSET, Close, and Mafia algorithms.
786 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
Fig. 1. Supermarket item taxonomy [12].
Fig. 2. Visualization of the ontology created based on the supermarket
item taxonomy.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
More recently, Li [6] proposed optimal rules sets, defined
with respect to an interestingness metric. An optimal rule
set contains all rules except those with no greater interest-
ingness than one of its more general rules.
A set of reduction techniques for redundant rules was
proposed and implemented in [21]. The developed
techniques are based on the generalization/specification
of the antecedent/consequent of the rules and they are
divided in methods for multiantecedent rules and multi-
consequent rules.
Hahsler et al. [22] were interested in the idea of
generating association rules from arbitrary sets of itemsets.
This makes possible for a user to propose a set of itemsets
and to integrate another set generated by a data mining
tool. In order to generate rules, a support counter is needed;
consequently, the authors proposed an adequate data
structure which provides fast access: prefix trees.
Toivonen et al. proposed in [9] a novel technique for
redundancy reduction based on rule covers. The notion of
rule cover is defined as the subset of a rule set describing
the same database transaction set as the rule set. Thus, the
authors developed an algorithm to efficiently extract a rule
cover out of a set of given rules.
The notion of subsumed rules, discussed in [23], describes
a set of rules having the same consequent and several
additional conditions in the antecedent regarding a certain
rule. Bayardo, Jr., et al. [24] proposed a new pruning
measure (Minimum Improvement) described as the difference
between the confidences of two rules in a specification/
generalization relationship. The specific rule is pruned if the
proposed measure is less than a prespecified threshold, so
the rule does not bring more information compared to the
general one.
Nevertheless, both closed and maximal itemset mining
still break down at low support thresholds. To address these
limitations, Omiecinski proposed in [25] three new impor-
tant interestingness measures: any-confidence, all-confidence,
and bond. All these measures are indicators of the degree of
relatedness between the items in an association. The most
interesting one, all-confidence, introduced as an alternative to
support, represents the minimum confidence of all associa-
tion rules extracted from an itemset. Bond is also similar to
support, but with respect to a subset of the data rather than
the entire database.
4.3 User-Driven Association Rule Mining
Interestingness measures were proposed in order to dis-
cover only those association rules that are interesting
according to these measures. They have been divided into
objective measures and subjective measures. Objective mea-
sures depend only on data structure. Many survey papers
summarize and compare the objective measure definitions
and properties [26], [27]. Unfortunately, being restricted to
data evaluation, the objective measures are not sufficient to
reduce the number of extracted rules and to capture the
interesting ones. Several approaches integrating user
knowledge have been proposed.
In addition, subjective measures were proposed to
integrate explicitly the decision-maker knowledge and to
offer a better selection of interesting association rules.
Silbershatz and Tuzilin [3] proposed a classification of
subjective measures in unexpectednessa pattern is interest-
ing if it is surprising to the userand actionabilitya pattern
is interesting if it can help the user take some actions.
As early as 1994, in the KEFIR system [28], the key
finding and deviation notions were suggested. Grouped in
findings, deviations represent the difference between the
actual and the expected values. KEFIR defines interesting-
ness of a key finding in terms of the estimated benefits, and
potential savings of taking corrective actions that restore the
deviation back to its expected value. These corrective
actions are specified in advance by the domain expert for
various classes of deviations.
Later, Klemettinen et al. [29] proposed templates to
describe the form of interesting rules (inclusive templates)
and not interesting rules (restrictive templates). The idea of
using templates for association rule extraction was reused
in [30]. Other approaches proposed to use a rule-like
formalism to express user expectations [3], [12], [31], and
the discovered association rules are pruned/summarized
by comparing them to user expectations.
Imielinski et al. [32] proposed a query language for
association rule pruning based on SQL, called M-SQL. It
allows imposing constraints on the condition and/or the
consequent of the association rules. In the same domain of
query-based association rule pruning, but more constraints-
driven, Ng et al. [33] proposed an architecture for exploratory
mining of rules. The authors suggested a set of solutions for
several problems: the lack of user exploration and control,
the rigid notion of relationship, and the lack of focus. In
order to overcome these problems, Ng et al. proposed a new
query language called Constrained Association Query and
they pointed out the importance of user feedback and user
flexibility in choosing interestingness metrics.
Another related approach was proposed by An et al. in
[34] where the authors introduced domain knowledge in order
to prune and summarize discovered rules. The first
algorithm uses a data taxonomy, defined by user, in order
to describe the semantic distance between rules, and in
order to group the rules. The second algorithm allows to
group the discovered rules that share at least one item in the
antecedent and the consequent.
In 2007, a new methodology was proposed in [35] to
prune and organize rules with the same consequent. The
authors suggested transforming the database in an associa-
tion rule base in order to extract second-level association
rules. Called metarules, the extracted rules r1 r2 express
relations between the two association rules and help
pruning/grouping discovered rules.
4.4 Ontologies in Data Mining
In knowledge engineering and Semantic Web fields,
ontologies have interested researchers since their first
proposition in the philosophy branch by Aristotle. Ontol-
ogies have evolved over the years from controlled vocabul-
aries to thesauri (glossaries), and later, to taxonomies [36].
In the early 1990s, an ontology was defined by Gruber as
a formal, explicit specification of a shared conceptualization [37].
By conceptualization, we understand here an abstract model
of some phenomenon described by its important concepts.
The formal notion denotes the idea that machines should be
able to interpret an ontology. Moreover, explicit refers to the
transparent definition of ontology elements. Finally, shared
outlines that an ontology brings together some knowledge
common to a certain group, and not individual knowledge.
Several other definitions are proposed in the literature.
For instance, in [38], an ontology is viewed as a logical theory
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 787
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
accounting for the intended meaning of a formal vocabulary, and
later, in 2001, Maedche and Staab proposed a more
artificial-intelligence-oriented definition. Thus, ontologies
are described as (meta)data schemas, providing a controlled
vocabulary of concepts, each with an explicitly defined and
machine processable semantics [16].
Depending on the granularity, four types of ontologies
are proposed in the literature: upper (or top level)
ontologies, domain ontologies, task ontologies, and applica-
tion ontologies [38]. Top-level ontologies deal with general
concepts; while the other three types deal with domain-
specific concepts.
Ontologies, introduced in data mining for the first time
in early 2000, can be used in several ways [39]: Domain and
Background Knowledge Ontologies, Ontologies for Data
Mining Process, or Metadata Ontologies. Background Knowl-
edge Ontologies organize domain knowledge and play
important roles at several levels of the knowledge discovery
process. Ontologies for Data Mining Process codify mining
process description and choose the most appropriate task
according to the given problem; while Metadata Ontologies
describe the construction process of items.
In this paper, we focus on Domain and Background
Knowledge Ontologies. The first idea of using Domain
Ontologies was introduced by Srikant and Agrawal with
the concept of Generalized Association Rules (GAR) [40]. The
authors proposed taxonomies of mined data (an is-a
hierarchy) in order to generalize/specify rules.
In [41], it is suggested that an ontology of background
knowledge can benefit all the phases of a KDD cycle
described in CRISP-DM. The role of ontologies is based on
the given mining task and method, and on data character-
istics. From business understanding to deployment, the
authors delivered a complete example of using ontologies
in a cardiovascular risk domain.
Related to Generalized Association Rules, the notion of
raising was presented in [42]. Raising is the operation of
generalizing rules (making rules more abstract) in order to
increase support in keeping confidence high enough. This
allows for strong rules to be discovered and also to obtain
sufficient support for rules that, before raising, would not
have minimum support due to the particular items they
referred to. The difference with Generalized Association
Rules is that this solution proposes to use a specific level for
raising and mining.
Another contribution, very close to [40], [41], uses
taxonomies to generalize and prune association rules. The
authors developed an algorithm, called GART [43], which,
having several taxonomies over attributes, uses iteratively
each taxonomy in order to generalize rules, and then,
prunes redundant rules at each step.
A very recent approach, [44], uses ontologies in a
preprocessing step. Several domain-specific and user-defined
constraints are introduced and grouped into two types:
pruning constraints, meant to filter uninteresting items, and
abstraction constraints permitting the generalization of items
toward ontology concepts. The data set is first preprocessed
according to the constraints extracted from the ontology,
and then, the data mining step takes place. The difference
with our approach is that, first, they apply constraints in the
preprocessing task, whereas we work in the postprocessing
task. The advantage of the pruning constraints is that it
permits to exclude from the start the information that the
user is not interested in, thus permitting to apply the
Apriori algorithm to this new database. Let us consider that
the user is not sure about which items he/she should prune.
In this case, he/she should create several pruning tests, and
for each test, he/she will have to apply the Apriori
algorithm whose execution time is very high. Second, they
use SeRQL in order to express user knowledge, and we
propose a more expressive and flexible language for user
expectation representation, i.e., Rule Schemas.
The item-relatedness filter was proposed by Natarajan and
Shekar [45]. Starting from the idea that the discovered rules
are generally obvious, they introduced the idea of related-
ness between the items measuring their similarity according
to item taxonomies. This measure computes the relatedness
of all the couples of rule items. We can notice that we can
compute the relatedness for the items of the condition or/
and the consequent, or between the condition and the
consequent of the rule.
While Natarajan and Shekar measure the item-relatedness
of an association rule, Garcia et al. developed in [46] and
extended in [47] a novel technique called Knowledge
Cohesion (KC). The proposed metric is composed of two
new ones: Semantic Distance (SD) and Relevance Assessment
(RA). The SD one measures how close two items are
semantically, using the ontologyeach type of relation
being weighted differently. The numerical value RA
expresses the interest of the user for certain pairs of items
in order to encourage the selection of rules containing those
pairs. In this paper, the ontology is used only for the SD
computation, differing from our approach which uses
ontologies for Rule Schemas definition. Moreover, the
authors propose a metric-based approach for itemset
selection, while we propose a pruning/filtering schemas-
based method of association rules.
5 DESCRIPTION of the ARIPSO FRAMEWORK
The proposed approach is composed of two main parts (as
shown in Fig. 3). First, the knowledge base allows
formalizing user knowledge and goals. Domain knowledge
offers a general view over user knowledge in database
domain, and user expectations express the prior user
knowledge over the discovered rules. Second, the post-
processing task consists in applying iteratively a set of filters
over extracted rules in order to extract interesting rules:
minimum improvement constraint filter, item-relatedness
filter, rule schema filters/pruning.
The novelty of this approach resides in supervising the
knowledge discovery process using two different conceptual
structures for user knowledge representation: one or several
788 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
Fig. 3. Framework description.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
ontologies and several rule schemas generalizing general
impressions, and proposing an iterative process.
5.1 Interactive Postmining Process
The ARIPSO framework proposes to the user an interactive
process of rule discovery, presented in Fig. 4. Taking into
account his/her feedbacks, the user is able to revise his/her
expectations in function of intermediate results. Several
steps are suggested to the user in the framework as follows:
1. ontology constructionstarting from the database,
and eventually, from existing ontologies, the user
develops an ontology on database items;
2. defining Rule Schemas (as GIs and RPCs)the user
expresses his/her local goals and expectations
concerning the association rules that he/she wants
to find;
3. choosing the right operators to be applied over the rule
schemas created, and then, applying the operators;
4. visualizing the resultsthe filtered association rules
are proposed to the user;
5. selection/validationstarting from these preliminary
results, the user can validate the results or he/she
can revise his/her information;
6. we propose to the user two filters already existing in
the literature and detailed in Section 5.5. These two
filters can be applied over rules whenever the user
needs them with the main goal of reducing the
number of rules; and
7. the interactive loop permits to the user to revise the
information that he/she proposed. Thus, he/she can
return to step 2 in order to modify the rule schemas,
or he/she can return to step 3 in order to change the
operators. Moreover, in the interactive loop, the user
could decide to apply one of the two predefined
filters discussed in step 6.
5.2 Improving General Impressions with Ontologies
One existing approach interests us in particularthe
specification language proposed by Liu et al. [17]. The
authors proposed to represent user expectations in terms of
discovered rules using three levels of specification: General
Impressions, Reasonably Precise Conceptsrepresenting
user vague feelings, and finally, his/her Precise Knowledge.
The authors developed a representation formalism which
is very close to association rule formalism, flexible enough,
and comprehensible for the user. For the case of General
Impressions, the authors proposed the following syntax:
gi(<S
1
; . . . ; S
m
>) [support; confidence[;
where S
i
is an element of an item taxonomy or an
expression defined using
+
=? operators, and support and
confidence thresholds are optional.
In the GI formalism, we can remark that the user knows
that a set of items is associated, but he/she does not know
which is the direction of the implication, which items he/
she would put in the antecedent, and which ones in the
consequent. This is the main difference between GIs and
RPCsRPCs are able to describe a complete implication.
PKs express the same formalism as the RPCs, adding
obligatory constraints of support and confidence.
Moreover, the authors proposed to filter four types of
rules using: conforming rules and unexpected rules con-
cerning the antecedent and/or the consequent:
. Conforming rulesassociation rules that are conform-
ing to the specified beliefs;
. Unexpected antecedent rulesassociation rules that are
unexpected regarding the antecedent of the specified
beliefs;
. Unexpected consequent rulesassociation rules that
are unexpected regarding the consequent of the
specified beliefs; and
. Both side unexpected rulesassociation rules that are
unexpected regarding both the antecedent and the
consequent of the specified beliefs.
To improve association rule selection, we propose a new
rule filtering model, called Rule Schemas (RS). A rule schema
describes, in a rule-like formalism, the user expectations in
terms of interesting/obvious rules. As a result, Rule
Schemas act as a rule grouping, defining rule families.
The Rule Schema formalism is based on the specification
language for user knowledge introduced by Liu et al. [12].
The model proposed by Liu et al. is described using
elements from an item taxonomy allowing an is-a organiza-
tion of database attributes. Using item taxonomies has
many advantages: the representation of user expectations is
more general, and thus, filtered rules are more interesting
for the user.
However, a taxonomy of items might not be enough. The
user might want to use concepts that are more expressive and
accurate than generalized concepts and that result from
relationships other than the is-a relation (i.e., IsEcological,
IsCookedWith). This is why we have considered that the use of
ontologies wouldbe more appropriate. An ontology includes
the features of taxonomies but adds more representation
power. In a taxonomy, the means for subject description
consist essentially of one relationship: the subsumption
relationship used to build the hierarchy. The set of items is
opened, but the language used to describe themis closed [48]
by using a single relationship (the subsumption). Thus, a
taxonomy is simply a hierarchical categorization or classifi-
cation of items in a domain. On the contrary, an ontology is a
specification of several characteristics of a domain, defined
using an open vocabulary.
In addition, it is difficult for a domain expert to know
exactly the support and confidence thresholds for each rule
schema proposed, because of their statistical definition.
That is why we consider that using Precise Knowledge in
user expectation representation might be useless. Thus, we
propose to improve only two of the three representations
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 789
Fig. 4. Interactive process description.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
introduced in [12]: General Impressions and Reasonably
Precise Concepts.
Therefore, Rule Schemas bring the expressiveness of
ontologies in the postprocessing task of association rules
combining not only item constraints, but also ontology
concept constraints.
Definition 7. A Rule Schema expresses the fact that the user
expects certain elements to be associated in the extracted
association rules. This can be expressed as
RS(<X
1
; . . . ; X
n
() Y
1
; . . . ; Y
m
>);
where X
i
; Y
j
C of O = C; R; I; H; A and the implication
is optional. In other words, we can note that the
proposed formalism combines General Impressions and
Reasonably Precise Concepts. If we use the formalism as an
implication, an implicative Rule Schema is defined extending
the RPC. On the other hand, if we do not keep the implication,
we define nonimplicative Rules Schemas, generalizing GI.
Example. Let us consider the taxonomy in Fig. 1. Let us
develop an ontology based on this taxonomy containing
the description of the concepts BioProducts and Ecologi-
calProducts, as shown in Fig. 2. Thus, we can define a
nonimplicative Rule Schema:
RS1(<BioProducts; EcologicalProducts>);
and an implicative Rule Schema:
RS2(<BioProducts EcologicalProducts>):
5.3 Ontology Description
Domain knowledge, defined as the user information
concerning the database, is described in our framework
using ontologies.
Compared to taxonomies used in the specification
language proposed in [12], ontologies offer a more complex
knowledge representation model by extending the only is-a
relation presented in a taxonomy with the set R of relations.
In addition, the axioms bring important improvements
permitting concept definition starting from existing infor-
mation in the ontology.
In this scenario, it is fundamental to connect ontology
concepts C of O = C; R; I; H; A to the database, each one
of them being connected to one/several items of I. To this
end, we consider three types of concepts: leaf-concepts,
generalized concepts from the subsumption relation (_) in H
of O, and restriction concepts proposed only by ontologies.
In order to proceed with the definition of each type of
concepts, let us remind that a set of items in a database is
defined as I = i
1
; i
2
; . . . ; i
n
.
The leaf-concepts (C
0
) are defined as
C
0
= c
0
C [ ,c
/
C; c
/
_ c
0
:
They are connected in the easiest way to databaseeach
concept from C
0
is associated to one item in the database:
f
0
: C
0
I; \c
0
C
0
; i I; i = f
0
(c
0
):
Generalized concepts (C
1
) are described as the concepts that
subsume other concepts in the ontology. A generalized
concept is connected to the database through its subsumed
concepts. This means that, recursively, only the leaf-concepts
subsumed by the generalized concept contribute to its
database connection:
f : C
1
2
I
\c
1
C
1
; f(c
1
) =
[
c
0
C
0
i = f
0
(c
0
) [ c
0
_ c
1
:
Restriction concepts are described using logical expres-
sions defined over items and are organized in the C
2
subset.
In a first attempt, we base the description of the concepts on
restrictions over properties available in description logics.
Thus, the restriction concept defined could be connected to
a disjunction of items.
Example. In order to explain restriction concepts, let us
consider the database presented in Fig. 5 and described
by three transactions. Moreover, let us consider the
ontology presented in the same figure as being the
ontology constructed over items of database and
described as follows:
The concepts of the ontology are
FoodItems; Fruits; DailyProducts; Meat;
DietProducts; EcologicProducts; . . .:
And the three types of concepts are:
LeafConcepts : grape; pear; apple; milk; cheese;
butter; beef; chicken; pork;
GeneralizedConcepts : Fruits; DailyProducts;
Meat; FoodItem;
RestrictionConcepts : DietProducts;
EcologicalProducts:
Two data properties are also integrated in order to define
whether a product is useful for a diet, or is ecological. For
example, the DietProduct restriction concept is described
using description logics language by
DietProducts = FoodItems isDiet:TRUE
790 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
Fig. 5. Ontology description.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
defining all food items that have the Boolean property
isDiet on TRUE. For our example, isDiet is instantiated
as follows:
isDiet : (apple; TRUE); (chicken; TRUE):
Now, we are able to connect the ontology and the
database. As already presented, leaf-concepts are connected
to items in a very simple way, for example, the concept
grape is connected to the same item f
0
(grape) = grape.
On the contrary, the generalized concept Fruits is
connected through its three subsumed concepts:
f(Fruits) = grape; pear; apple:
Similarly, we can describe the connection for the other
concepts.
More interesting, the restrictionconcept DietProducts will
be connected through those concepts satisfying the restric-
tions in the definition of the concept. Thus, DietProducts is
connected through the concepts apple and chicken:
f(DietProducts) = apple; chicken:
5.4 Operations over Rule Schemas
The rule schema filter is based on operators applied over
rule schemas allowing the user to perform several actions
over the discovered rules. We propose two important
operators: pruning and filtering operators. The filtering
operator is composed of three different operators: conform-
ing, unexpectedness, and exception. We propose to reuse the
operators proposed by Liu et al.: conforming and unexpect-
edness, and we bring two new operators in the postproces-
sing task: pruning and exceptions.
These four operators will be presented in this section. To
this end, let us consider an implicative rule schema RS
1
:
(<X Y >), a nonimplicative rule schema RS
2
: (<U; V >),
and an association rule AR
1
: A B, where X, Y , U, and V
are the ontology concepts, and A and B are the itemsets.
Definition 8. Let us consider an ontology concept C associated
in the database to f(C) = y
1
; . . . ; y
n
and an itemset
X = x
1
; . . . ; x
k
. We say that the itemset X is conforming
to the concept C if y
i
; y
i
X.
Pruning. The pruning operator allows to the user to
remove families of rules that he/she considers uninterest-
ing. In databases, there exist, in most cases, relations
between items that we consider obvious or that we already
know. Thus, it is not useful to find these relations among
the discovered associations. The pruning operator applied
over a rule schema, P(RS), eliminates all association rules
matching the rule schema. To extract all the rules matching
a rule schema, the conforming operator is used.
Conforming. The conforming operator applied over a
rule schema, C(RS), confirms an implication or finds the
implication between several concepts. As a result, rules
matching all the elements of a nonimplicative rule schema
are filtered. For an implicative rule schema, the condition
and the conclusion of the association rule should match
those of the schema.
Example. The rule AR
1
is selected by the operator C(RS
1
) if
the condition and the conclusion of the rule AR
1
are
conforming to the condition and, respectively, the conclu-
sion of RS
1
. Translating this description into the ontolo-
gical definition of concepts means that AR
1
is conforming
toRS
1
if the itemset Ais conformingtothe concept Xandif
the itemset B is conforming to the concept Y .
Similarly, rule AR
1
is filteredby C(RS
2
) if the condition
and/or the conclusion of the rule AR
1
are conforming to
the schema RS
2
. In other words, if the itemset A B is
conforming to the concept U and the itemset A B is
conforming to the concept V , then the rule AR
1
is
conforming with the nonimplicative rule schema RS
2
.
Unexpectedness. With a higher interest for the user, the
unexpectedness operator U(RS) proposes to filter a set of
rules with a surprise effect for the user. This type of rules
interests the user more than the conforming one since,
generally, a decision-maker searches to discover new
knowledge with regard to his/her prior knowledge.
Moreover, several types of unexpected rules can be
filtered according to the rule schema: rules unexpected
regarding the antecedent U
p
, rules unexpected regarding the
consequent U
c
, andrules unexpectedregardingbothsides U
b
.
For instance, let us consider that the operator U
p
(RS
1
)
extracts the rule AR
1
which is unexpected according to the
condition of the rule schema RS
1
. This is possible if the rule
consequent B is conforming to the concept Y , while the
condition itemset A is not conforming to the concept X.
In a similar way, we define the two other unexpected-
ness operators.
Exceptions. Finally, the exception operator is defined
only over implicative rule schemas (i.e., RS
1
) and extracts
conforming rules with respect to the following new
implicative rule schema: X . Z Y , where Z is a set
of items.
Example. Let us consider the implicative rule schema
RS : Fruits EcologicalProducts, where
f(Fruits) = grape; apple; pear
and
f(EcologicalProducts) = grape; milk;
and I = grape; apple; pear; milk; beef (see Fig. 1 for
supermarket taxonomy). Also, let us consider that the
following set of association rules is extracted by
traditional techniques:
R
1
: grape; beef milk; pear;
R
2
: apple beef;
R
3
: apple; pear; milk grape;
R
4
: grape; pear apple;
R
5
: beef grape;
R
6
: milk; beef grape:
Thus, the operator C(RS) filters the rules R
1
and R
3
,
the operator Up(RS) filters the rules R
5
and R
6
, and the
operator Uc(RS) filters the rules R
2
and R
4
. The pruning
operator P(RS) prunes the rules selected by the conform-
ing operator C(RS). Let us explain the operator Uc(RS):
Uc operator filters the rules whose conclusion itemset is
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 791
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
not conforming to the conclusion concept of the
RSEcologicalProductsand whose condition itemset is
conforming to the condition concept of the RSFruits.
The R
4
rule is filtered by Uc(RS) because the itemset apple
does not contain an item corresponding to an Ecological-
Products concept, apple , f(EcologicalProducts), and be-
cause the itemset grape pear contains at least one item
corresponding to a Fruits concept, pear f(Fruits).
5.5 Filters
In order to reduce the number of rules, three filters integrate
the framework: operators applied over rule schemas,
minimum improvement constraint filter [24], and item-
relatedness filter [45].
Minimum improvement constraint filter [24] (MICF)
selects only those rules whose confidence is greater with
minimp than the confidence of any of its simplifications.
Example. Let us consider the following three association
rules:
grape; pear milk (Confidence = 85%);
grape milk (Confidence = 90%);
pear milk (Confidence = 83%):
We can note that the last two rules are the simplifications
of the first one. The theory of Bayardo et al. tells us that the
first rule is interesting only if its confidence improves the
confidence of all its simplifications. In our case, the first
rule does not improve the confidence of 90 percent of the
best of its simplifications (the second rule), so it is not
considered as an interesting rule, and it is not selected.
The item-relatedness filter (IRF) was proposed by
Shekar and Natarajan [45]. Starting from the idea that the
discovered rules are generally obvious, they introduced
the idea of relatedness between items measuring their
semantic distance in item taxonomies. This measure
computes the relatedness of all the couples of rule items.
We can notice that we can compute the relatedness for
the items of the condition or/and the consequent, or
between the condition and the consequent of the rule.
In our approach, we use the last type of item-
relatedness because users are interested to find associa-
tion between itemsets with different functionalities,
coming from different domains. This measure is com-
puted as the minimum distance between the condition
items and the consequent items as presented hereafter.
The distance between each pair of items from the
condition and, respectively, the consequent is computed
as the minimum path that connects the two items in the
ontology, defined as d(a; b). Thus, the item-relatedness
(IR) for a rule is defined as the minimum of all the
distance computed between the items in the condition
and the consequent:
RA
1
: A B;
IR(RA
1
) = MIN(d
ij
(a
i
; b
j
)); \a
i
A and b
j
B:
Example. Let us consider the ontology in Fig. 2. For the
association rule R
1
, we can define the item-relatedness
as follows:
R
1
: grape; pear; butter > milk;
IR(R
1
) = min(d(grape; milk); d(pear; milk);
d(butter; milk)) = min(4; 4; 2) = 2:
6 EXPERIMENTAL STUDY
This study is based on a questionnaire database, provided
by Nantes Habitat,
2
dealing with customers satisfaction
concerning accommodation. The database consists in an
annual study (since 2003) performed by Nantes Habitat on a
sample of 1,500 out of a total of 50,000 customers.
The questionnaire consists of 67 different questions with
four possible answers expressing the degree of satisfaction:
very satisfied, quite satisfied, rather not satisfied, and dissatisfied
coded as 1; 2; 3; 4.
Table 1 introduces a question sample with the meaning
for each. For instance, the item q1 = 1 describes that a
customer is very satisfied by the transport in his district
(q1 = Is your district transport practical?).
In order to target the most interesting rules, we fixed a
minimum support of 2 percent, a maximum support of
30 percent, and a minimum confidence of 80 percent for the
association rules mining process. Among available algo-
rithms, we use the Apriori algorithm in order to extract
association rules and 358;072 rules are discovered.
For example, the following association rule describes the
relationship between questions q2, q3, q47, and the question
q70. Thus, if the customers are very satisfied by the access to
the city center (q2), the shopping facilities (q3), and the
apartment ventilation (q47), then they can be satisfied by the
documents received from Nantes Habitat Agency (q70) with
a confidence of 85.9 percent:
R
1
: q2 = 1 q3 = 1 q47 = 1 ==> q70 = 1;
Support = 15:2%; Confidence = 85:9%:
6.1 Ontology Structure and Ontology-Database
Mapping
In the first step of the interactive process described in the
Section 5.1, the user develops an ontology on database
items. In our case, starting from the database attributes, the
ontology was created by the Nantes Habitat expert. During
several session, we discussed with the expert about the
database attributes and proposed her to classify them.
792 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
TABLE 1
Examples of Questions and Meaning
2. http://www.nantes-habitat.fr/.
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
Moreover, we found other interesting information asking
her to develop her expectations and knowledge connected
to database attributes. In this section, we will present the
development of the ontology in our case study.
6.1.1 Conceptual Structure of the Ontology
To describe the ontology, we propose to use the Web
Semantic representation language, OWL-DL [49]. Based on
description logics, OWL-DL language permits, along with
the ontological structure, to create restriction concepts using
necessary and sufficient conditions over other concepts.
Also, we use the Protege [50] software to edit the ontology
and validate it. The Jambalaya [51] environment was used
for ontology graph exploration.
During several exchanges with the Nantes Habitat
expert, she developed an ontology composed of two main
parts, a sample being presented in Fig. 6. The ontology has
seven depth levels, a total of 130 concepts among which 113
are primitive concepts, and 17 are restriction concepts.
Concerning siblings, the concepts have a mean of six child
concepts, with a maximum of 13 child concepts. Moreover,
two data properties are introduced.
The first part of the ontology is a database item
organization with the root defined by the Attribute concept,
grouping 113 subsumed concepts. The items are organized
among the question topic in the Nantes Habitat question-
naire. For instance, considering the District concept, it
regroups 14 questions (from q1 to q14) concerning the
facilities and the life quality in a district.
The second hierarchy Topics regroups all 17 restriction
concepts created by the expert using necessary and
sufficient conditions over primitive concepts.
Moreover, the subsumption relation (_ ) is completed by
the relation hasAnswer associating the Attribute concepts to
an integer from 1; 2; 3; 4, simulating the relation attribute
value in the database.
For instance, let us consider the restriction concept
SatisfactionDistrict. In natural language, it expresses the
satisfaction answers of clients in the questions concerning
the district. In other words, an item is instantiated by the
SatisfactionDistrict concept if it represents a question
between q1 and q14, subsumed by the District concept, with
a satisfied answer (1 or 2). The SatisfactionDistrict
restriction concept is described using description logics
language by:
SatisfactionDistrict = District hasAnswer:1 OR
hasAnswer:2
6.1.2 Ontology-Database Mapping
As a part of rule schemas, ontology concepts are mapped to
database items. Thus, several connections between ontology
and database can be designed. Due to implementation
requirements, the ontology and the database are mapped
through instances.
The ontology-database connection is made manually by
the expert. In our case, with the 67 attributes and four
values, the expert did not meet any problems to realize the
connection, but we agree that for large databases, a
manually connection could be very time-consuming. That
is why integrating an automatic ontology construction plug-
in in our tool is one of our principal perspectives.
Thus, using the simplest ontology-database mapping, the
expert directly connected one instance of the ontology to an
item (semantically, the nearest one). For example, the expert
connected the instance Q11 1 to the item (q11 = 1):
f
0
(Q11 1) = (q11 = 1).
Then, leaf concepts (C
0
) of the Attribute hierarchy were
connected by the expert to a set of items (semantically, the
nearest one). Considering the concept Q11 of the ontology, it
is associated to the attribute q1 = Are you satisfied
with the transport in your district?: Furthermore, the con-
cept Q11 has two instances describing the question q11 with
twopossible answers: 1and3. Let us consider that theconcept
Q11 was connected by the expert to two items as follows:
f(Q11) = f
0
(Q11 1); f
0
(Q11 3) = q11 = 1; q11 = 3. The
connection of generalized concepts follows the same idea.
A second type of connection implies connecting concepts
of the Topic hierarchy to the database. Let us consider the
restriction concept DissastisfactionCalmDistrict (Fig. 7). In
natural language, it is defined as all the concepts, subsumed
by CalmDistrict (connected to questions q8, q9, q10, and q11)
and with a dissatisfied answer.
The DissastisfactionCalmDistrict restriction concept is
described by the expert using description logics language by:
DissastisfactionCalmDistrict
= CalmDistrict hasAnswer:3 OR hasAnswer:4:
Considering that the user has instantiated the concept Q8
with answer 3, and the concept Q11 with the answers 1 and
3, then the concept DissastisfactionCalmDistrict is con-
nected in the database as it follows:
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 793
Fig. 6. Ontology structure visualized with Jambalaya Prote ge plug-in.
Fig. 7. Restriction concept construction using necessary and sufficient
conditions in Prote ge .
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
f(DissastisfactionCalmDistrict) = f
0
(Q8 3); f
0
(Q11 3)
= q8 = 3; q11 = 3:
6.2 Results
Example 1. This first example proposes to present the
efficiency of our new approach concerning the reduction of
the number of rules. To this end, we propose to the expert to
test the four filters: on the one hand the pruning filters
MICF, IRF, and pruning rule schemasand on the other hand,
the selection filtersrule schema filters (meanings of acro-
nyms in Table 5). The expert could use each filter separately
and in several combinations in order to compare the results
and validate them.
Hence, the expert proposed a set of pruning rule schemas
(Table 2) and a set of filtering rule schemas (Table 3). She
constructed these rule schemas during several meetings of
testing the new tool and analyzing generated results.
At the beginning, the expert is faced to the whole set of
358,072 association rules extracted. In a first attempt, we
focus on pruning filters. If the MICF is applied, all the
specialized rules not improving confidence are pruned. In
Table 4, we can see that the MICF prunes 92.3 percent of
rules, being a very efficient filter for redundancy pruning. In
addition, IRF prunes 71 percent of rulesthese rules
implying items close semantically. The third pruning filter,
Pruning Rule Schemas, prunes 43 percent of rules.
We propose to compare the three pruning filters and the
combinations of the pruning filters, as presented in Table 4.
The first column is the reference for our experiments. The
rates of number of rules remaining after the three filters are
used separately are presented in columns 2, 3, and 4. We
can note that the MICF filter is the most discriminatory,
pruning 92.3 percent of rules, comparing to other two ones
pruning 71 percent and, respectively, 43 percent of rules.
We can also note that combining the first two filters, MICF
and IRF, the pruning is more powerful than combining the
first one with the third one. Nevertheless, applying the three
filters over the set of the association rules implies a rule
reduction of 96.3 percent.
However, applying the most reducing combination
number 8 (Table 4), the expert should analyze 13,382 rules
which is impossible manually. Thus, other filters should be
applied. The expert was interested in the dissatisfaction
phenomena, presented by answers 3 and 4 in the
questionnaire. The expert is interested in applying all the
rule schemas with the corresponding operator (Table 3) for
each combination of the first three filters presented in
Table 4. Table 6 presents the number of rules filtered by
each rule schema.
In Table 6, the first column Nb represents the identifica-
tion of each filter combination as denoted in Table 4. We can
note that the rule schema filters are very efficient. More-
over, studying the dissatisfaction of the clients improves the
filtering power of the rule schemas.
Let us consider the second rule schema. Applied over the
initial set of 358,072 association rules with the conforming
operator, it filters 1,008 rules representing 0.28 percent of the
complete set. But it is obvious that it is very difficult for an
expert to analyze a set of rules of the order of thousands of
rules. Thus, we can note the importance of the pruning
filters, the set of rules extracted in each case having less
than 500 rules. We can also note that the IRF filter is more
powerful than the other pruning filters, and the combina-
tion of two filters at the same time gives remarkable results:
. on the fifth line, combining MICF with IRF reduces
the number of rules to 77 rules;
. combining IRF with pruning using Rule Schemas the
set of rules is reduced to three rules; and
. we can also note that in the last two rows, the filters
have the same results. We can explain this by the fact
that we are working on an incomplete set of rules
because of the maximum support threshold that we
impose in the mining process.
It is very important to note that the quality of the selected
rules was certified by the Nantes Habitat expert.
794 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
TABLE 2
Pruning Rule Schemas
TABLE 3
Filtering Rule Schemas
TABLE 4
Pruning Rate for Each Filter Combination
TABLE 5
Notation Meaning
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
Example 2. This second example is proposed in order to
outline the quality of the filtered rules, and to confirm
the importance of the interactivity in our framework. To
this end, we present the sequence of steps (Fig. 8)
performed by the expert during the interactivity
process, steps already described in Section 6.1. We have
already presented the first step of the interactive
processontology constructionin Section 5.1.
As in the first example, the expert is faced to the whole
set of rules. In a first attempt (steps 2 and 3), she proposed to
investigate the quality of rules filtered by two of the rule
schemas RS
2
and RS
3
with the conforming operator. The
first one deals with dissatisfaction concerning the tranquility
in the district, and the second one searches rules associating
dissatisfaction in price with dissatisfaction concerning the
common areas of the building.
Applying these two schemas to the whole rule set, an
important selection is made:
. C(RS
2
) filters 1,008 association rules; and
. C(RS
3
) filters 96 association rules.
The expert is in the visualization and validation steps (4
and 5), and she analyzes the 96 rules filtered by C(RS
3
),
because of the reduced number of rules comparing to
1,008 filtered by C(RS
2
). For example, let us consider the
following set of association rules:
q17 = 4; q26 = 4; q97 = 4 ==> q28 = 4 C = 92:8% S = 2:6%
q16 = 4; q17 = 4; q26 = 4; q97 = 4 ==> q28 = 4 C = 92:5%
S = 2:5%
q15 = 4; q17 = 4; q97 = 4 ==> q28 = 4 C = 80:5% S = 1:9%
q15 = 4; q17 = 4; q97 = 4 ==> q26 = 4; q28 = 4 C = 80:5%
S = 1:9%:
The expert noted that the second rule is a specialization
of the first rulethe item q16 = 4 is added in the
antecedent, and she also noted that its confidence is lower
than the confidence of the more general rule. Thus, the
second rule does not bring important information to the
whole set of rule; hence, it can be pruned. In the same way,
the expert noted that the forth rule is the specialization of
the third one, and the confidence is not improved in this
case neither. The expert decided to modify her initial
information (step 5) and to go to the beginning of the process
via the interactivity loop (step 7), choosing to apply the
MCIF (step 6) which extracts 27,602 rules. The expert
decided to keep these results (steps 4 and 5) and to return
in the interactivity loop, going back to steps 2 and 3 in order
to redefine rule schemas and operators.
This time the expert proposed to use only the rule schema
C(RS
3
), as a consequence of high volume of rules extracted
by the other one. Using C(RS
3
), 50 rules are filtered, and the
presence of rules 1 and 3 and the absence of rules 2 and 4
(from the set presented above) validate the use of MICF
(steps 4 and 5). Moreover, the hight reduction of number of
rules validate the application of C(RS
3
). In this state, the
expert returned to step 2 in order to modify the rule schema
proposing RS
4
and first, she applied the unexpectedness
regarding the antecedent operator U
p
(RS
4
), and then, she
returned to step 3 in order to modify the operator, choosing
the exception one E(RS
4
). These results are briefly presented
in Table 6, but due to space limit, they are not detailed in this
section.
The expert analyzed the 50 rules extracted by C(RS
3
)
and she found several trivial implications noting that the
implication between several items did not interest her. For
instance, let us consider the following set of rules:
q17 = 4; q97 = 4 ==> q16 = 4 C = 86:7% S = 3:5%
q25 = 4; q28 = 4; q97 = 4 ==> q26 = 4 C = 100% S = 2:0%:
These rules imply items from EntryHall and CloseSur-
rounding; thus, the expert proposed to apply rule schemas
RS
5
to RS
8
with the pruning operator (steps 2 and 3) in order
to prune those not interesting rules. In consequence, 15 rules
are extracted, and the absence of the above rules validates
the application of pruning rule schemas (steps 4 and 5).
Let us consider the following two rules:
q28 = 4; q97 = 4 ==> q17 = 4 C = 81:1% S = 2:9%
q8 = 4; q16 = 4; q97 = 4 ==> q9 = 4 C = 88:6% S = 2:1%:
The expert noted that a great part of the 15 rules are
implications between attributes subsumed by the same
concept in the ontology. For instance, the attributes q28
and q17 of the first rule, described by the Q28 and the
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 795
Fig. 8. Description of the interactive process during the experiment.
TABLE 6
Rates for Rule Schema Filters Applied after
the Other Three Filter Combinations
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
Q17 concepts, are subsumed by the concept Stairwell.
Similarly, for the second rule, q8 and q9 are subsumed by
CalmDistrict concept. Thus, the expert applied the IRF
filter, and only three rules are filtered. One of these rules
attracts the interest of the expert:
q15 = 4; q16 = 4; q97 = 4 ==> q9 = 4;
Support = 2:3% Confidence = 79:1%;
which can be translated by: if a client is not satisfied with
the cleaning of the close surrounding and the entry hall, and
if he is not satisfied with the service charges, then it is
possible with a confidence of 79.1 percent that he considers
that his district has a bad reputation. This rule is very
interesting because the expert thought that the building
state does not influence the opinion concerning the district,
but it is obvious that this is the case.
7 CONCLUSION
This paper discusses the problem of selecting interesting
association rules throughout huge volumes of discovered
rules. The major contributions of our paper are stated
below. First, we propose to integrate user knowledge in
association rule mining using two different types of
formalism: ontologies and rule schemas. On the one hand,
domain ontologies improve the integration of user domain
knowledge concerning the database field in the postpro-
cessing step. On the other hand, we propose a new
formalism, called Rule Schemas, extending the specification
language proposed by Liu et al. The latter is especially
used to express the user expectations and goals concerning
the discovered rules.
Second, a set of operators, applicable over the rule
schemas, is proposed in order to guide the user throughout
the postprocessing step. Thus, several types of actions, as
pruning and filtering, are available to the user. Finally, the
interactivity of our ARIPSO framework, relying on the set of
rule mining operators, assists the user throughout the
analyzing task and permits him/her an easier selection of
interesting rules by reiterating the process of filtering rules.
By applying our new approach over a voluminous
questionnaire database, we allowed the integration of
domain expert knowledge in the postprocessing step in
order to reduce the number of rules to several dozens or
less. Moreover, the quality of the filtered rules was
validated by the expert throughout the interactive process.
ACKNOWLEDGMENTS
The authors would like to thank Nantes Habitat, the Public
Housing Unit in Nantes, France, and more specially Ms.
Christelle Le Bouter, and also M. Loic Glimois for
supporting this work.
REFERENCES
[1] R. Agrawal, T. Imielinski, and A. Swami, Mining Association
Rules between Sets of Items in Large Databases, Proc. ACM
SIGMOD, pp. 207-216, 1993.
[2] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
Advances in Knowledge Discovery and Data Mining. AAAI/MIT
Press, 1996.
[3] A. Silberschatz and A. Tuzhilin, What Makes Patterns Interesting
in Knowledge Discovery Systems, IEEE Trans. Knowledge and
Data Eng. vol. 8, no. 6, pp. 970-974, Dec. 1996.
[4] M.J. Zaki and M. Ogihara, Theoretical Foundations of Associa-
tion Rules, Proc. Workshop Research Issues in Data Mining and
Knowledge Discovery (DMKD 98), pp. 1-8, June 1998.
[5] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu,
Mafia: A Maximal Frequent Itemset Algorithm, IEEE Trans.
Knowledge and Data Eng., vol. 17, no. 11, pp. 1490-1504, Nov. 2005.
[6] J. Li, On Optimal Rule Discovery, IEEE Trans. Knowledge and
Data Eng., vol. 18, no. 4, pp. 460-471, Apr. 2006.
[7] M.J. Zaki, Generating Non-Redundant Association Rules, Proc.
Intl Conf. Knowledge Discovery and Data Mining, pp. 34-43, 2000.
[8] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Efficient Mining
of Association Rules Using Closed Itemset Lattices, Information
Systems, vol. 24, pp. 25-46, 1999.
[9] H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hatonen, and H.
Mannila, Pruning and Grouping of Discovered Association
Rules, Proc. ECML-95 Workshop Statistics, Machine Learning, and
Knowledge Discovery in Databases, pp. 47-52, 1995.
[10] B. Baesens, S. Viaene, and J. Vanthienen, Post-Processing of
Association Rules, Proc. Workshop Post-Processing in Machine
Learning and Data Mining: Interpretation, Visualization, Integration,
and Related Topics with Sixth ACM SIGKDD, pp. 20-23, 2000.
[11] J. Blanchard, F. Guillet, and H. Briand, A User-Driven and
Quality-Oriented Visualization for Mining Association Rules,
Proc. Third IEEE Intl Conf. Data Mining, pp. 493-496, 2003.
[12] B. Liu, W. Hsu, K. Wang, and S. Chen, Visually Aided Exploration
of Interesting Association Rules, Proc. Pacific-Asia Conf. Knowledge
Discovery and Data Mining (PAKDD), pp. 380-389, 1999.
[13] G. Birkhoff, Lattice Theory, vol. 25. Am. Math. Soc., 1967.
[14] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal, Discovering
Frequent Closed Itemsets for Association Rules, Proc. Seventh Intl
Conf. Database Theory (ICDT 99), pp. 398-416, 1999.
[15] M. Zaki, Mining Non-Redundant Association Rules, Data
Mining and Knowledge Discovery, vol. 9, pp. 223-248, 2004.
[16] A. Maedche and S. Staab, Ontology Learning for the Semantic
Web, IEEE Intelligent Systems, vol. 16, no. 2, pp. 72-79, Mar. 2001.
[17] B. Liu, W. Hsu, L.-F. Mun, and H.-Y. Lee, Finding Interesting
Patterns Using User Expectations, IEEE Trans. Knowledge and
Data Eng., vol. 11, no. 6, pp. 817-832, Nov. 1999.
[18] I. Horrocks and P.F. Patel-Schneider, Reducing owl Entailment to
Description Logic Satisfiability, J. Web Semantics, pp. 17-29,
vol. 2870, 2003.
[19] J. Pei, J. Han, and R. Mao, Closet: An Efficient Algorithm for
Mining Frequent Closed Itemsets, Proc. ACM SIGMOD Workshop
Research Issues in Data Mining and Knowledge Discovery, pp. 21-30,
2000.
[20] M.J. Zaki and C.J. Hsiao, Charm: An Efficient Algorithm for
Closed Itemset Mining, Proc. Second SIAM Intl Conf. Data Mining,
pp. 34-43, 2002.
[21] M.Z. Ashrafi, D. Taniar, and K. Smith, Redundant Association
Rules Reduction Techniques, AI 2005: Advances in Artificial
Intelligence Proc 18th Australian Joint Conf. Artificial Intelligence
pp. 254-263, 2005.
[22] M. Hahsler, C. Buchta, and K. Hornik, Selective Association Rule
Generation, Computational Statistic, vol. 23, no. 2, pp. 303-315,
Kluwer Academic Publishers, 2008.
[23] J. Bayardo, J. Roberto, and R. Agrawal, Mining the Most
Interesting Rules, Proc. ACM SIGKDD, pp. 145-154, 1999.
[24] R.J. Bayardo, Jr., R. Agrawal, and D. Gunopulos, Constraint-
Based Rule Mining in Large, Dense Databases, Proc. 15th Intl
Conf. Data Eng. (ICDE 99), pp. 188-197, 1999.
[25] E.R. Omiecinski, Alternative Interest Measures for Mining
Associations in Databases, IEEE Trans. Knowledge and Data Eng.,
vol. 15, no. 1, pp. 57-69, Jan./Feb. 2003.
[26] F. Guillet and H. Hamilton, Quality Measures in Data Mining.
Springer, 2007.
[27] P.-N. Tan, V. Kumar, and J. Srivastava, Selecting the Right
Objective Measure for Association Analysis, Information Systems,
vol. 29, pp. 293-313, 2004.
[28] G. Piatetsky-Shapiro and C.J. Matheus, The Interestingness of
Deviations, Proc. AAAI94 Workshop Knowledge Discovery in
Databases, pp. 25-36, 1994.
[29] M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I.
Verkamo, Finding Interesting Rules from Large Sets of Dis-
covered Association Rules, Proc. Intl Conf. Information and
Knowledge Management (CIKM), pp. 401-407, 1994.
796 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 22, NO. 6, JUNE 2010
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.
[30] E. Baralis and G. Psaila, Designing Templates for Mining
Association Rules, J. Intelligent Information Systems, vol. 9, pp. 7-
32, 1997.
[31] B. Padmanabhan and A. Tuzhuilin, Unexpectedness as a
Measure of Interestingness in Knowledge Discovery, Proc.
Workshop Information Technology and Systems (WITS), pp. 81-90,
1997.
[32] T. Imielinski, A. Virmani, and A. Abdulghani, Datamine:
Application Programming Interface and Query Language for
Database Mining, Proc. Intl Conf. Knowledge Discovery and Data
Mining (KDD), pp. 256-262, http://www.aaai.org/Papers/KDD/
1996/KDD96-042.pdf, 1996.
[33] R.T. Ng, L.V.S. Lakshmanan, J. Han, and A. Pang, Exploratory
Mining and Pruning Optimizations of Constrained Associations
Rules, Proc. ACM SIGMOD Intl Conf. Management of Data, vol. 27,
pp. 13-24, 1998.
[34] A. An, S. Khan, and X. Huang, Objective and Subjective
Algorithms for Grouping Association Rules, Proc. Third IEEE
Intl Conf. Data Mining (ICDM 03), pp. 477-480, 2003.
[35] A. Berrado and G.C. Runger, Using Metarules to Organize and
Group Discovered Association Rules, Data Mining and Knowledge
Discovery, vol. 14, no. 3, pp. 409-431, 2007.
[36] M. Uschold and M. Gru ninger, Ontologies: Principles, Methods,
and Applications, Knowledge Eng. Rev., vol. 11, pp. 93-155, 1996.
[37] T.R. Gruber, A Translation Approach to Portable Ontology
Specifications, Knowledge Acquisition, vol. 5, pp. 199-220, 1993.
[38] N. Guarino, Formal Ontology in Information Systems, Proc. First
Intl Conf. Formal Ontology in Information Systems, pp. 3-15, 1998.
[39] H. Nigro, S.G. Cisaro, and D. Xodo, Data Mining with Ontologies:
Implementations, Findings and Frameworks. Idea Group, Inc., 2007.
[40] R. Srikant and R. Agrawal, Mining Generalized Association
Rules, Proc. 21st Intl Conf. Very Large Databases, pp. 407-419,
http://citeseer.ist.psu.edu/srikant95mining.html, 1995.
[41] V. Svatek and M. Tomeckova, Roles of Medical Ontology in
Association Mining Crisp-dm Cycle, Proc. Workshop Knowledge
Discovery and Ontologies in ECML/PKDD, 2004.
[42] X. Zhou and J. Geller, Raising, to Enhance Rule Mining in Web
Marketing with the Use of an Ontology, Data Mining with
Ontologies: Implementations, Findings and Frameworks, pp. 18-36,
Idea Group Reference, 2007.
[43] M.A. Domingues and S.A. Rezende, Using Taxonomies to
Facilitate the Analysis of the Association Rules, Proc. Second Intl
Workshop Knowledge Discovery and Ontologies, held with ECML/
PKDD, pp. 59-66, 2005.
[44] A. Bellandi, B. Furletti, V. Grossi, and A. Romei, Ontology-
Driven Association Rule Extraction: A Case Study, Proc. Work-
shop Context and Ontologies: Representation and Reasoning, pp. 1-10,
2007.
[45] R. Natarajan and B. Shekar, A Relatedness-Based Data-Driven
Approach to Determination of Interestingness of Association
Rules, Proc. 2005 ACM Symp. Applied Computing (SAC), pp. 551-
552, 2005.
[46] A.C.B. Garcia and A.S. Vivacqua, Does Ontology Help Make
Sense of a Complex World or Does It Create a Biased Interpreta-
tion? Proc. Sensemaking Workshop in CHI 08 Conf. Human Factors
in Computing Systems, 2008.
[47] A.C.B. Garcia, I. Ferraz, and A.S. Vivacqua, From Data to
Knowledge Mining, Artificial Intelligence for Eng. Design, Analysis
and Manufacturing, vol. 23, pp. 427-441, 2009.
[48] L.M. Garshol, Metadata? Thesauri? Taxonomies? Topic Maps
Making Sense of It All, J. Information Science, vol. 30, no. 4,
pp. 378-391, 2004.
[49] I. Horrocks and P.F. Patel-Schneider, A Proposal for an owl Rules
Language, Proc. 13th Intl Conf. World Wide Web, pp. 723-731,
2004.
[50] W.E. Grosso, H. Eriksson, R.W. Fergerson, J.H. Gennari, S.W. Tu,
and M.A. Musen, Knowledge Modeling at the Millennium (the
Design and Evolution of Protege-2000), Proc. 12th Workshop
Knowledge Acquisition, Modeling and Management (KAW 99), 1999.
[51] M.-A. Storey, N.F. Noy, M. Musen, C. Best, R. Fergerson, and N.
Ernst, Jambalaya: An Interactive Environment for Exploring
Ontologies, Proc. Seventh Intl Conf. Intelligent User Interfaces
(IUI 02), pp. 239-239, 2002.
Claudia Marinica received the masters de-
gree in KDD from the Polytechnique School
of Nantes University in 2006, and the Com-
puter Science degree from Politehnica Uni-
versity of Bucharest, Romania, in 2006. She is
currently working toward the PhD degree in
computer science in the Knowledge and
Decision Team, LINA UMR CNRS 6241 at
Polytechnique School of Nantes University,
France. Her main research interests are in
Association Rule Mining and Semantic Web.
Fabrice Guillet received the PhD degree in
computer sciences from the Ecole Nationale
Superieure des Telecommunications de Bre-
tagne in 1995. He has been an associate
professor (HdR) in computer science at Poly-
techNantes, and a member of the KnOwledge
and Decision team (KOD) in the Nantes-
Atlantic Laboratory of Computer Sciences (LINA
UMR CNRS 6241) since 1997. He is a founder
of the Knowledge Extraction and Management
French-speaking association of research (EGC, www.egc.asso.fr). His
research interests include knowledge quality and visualization in the
frameworks of Data Mining and Knowledge Engineering. He has
recently coedited two refereed books of chapter entitled Quality
Measures in Data Mining (Springer, 2007), and Statistical Implicative
AnanlysisTheory and Applications (Springer, 2008).
. For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
MARINICA AND GUILLET: KNOWLEDGE-BASED INTERACTIVE POSTMINING OF ASSOCIATION RULES USING ONTOLOGIES 797
Authorized licensed use limited to: Shree Motilal Kanhaiyalal Fomra Institute. Downloaded on July 22,2010 at 08:53:21 UTC from IEEE Xplore. Restrictions apply.