Lecture Notes in Artificial Intelligence 3430
Lecture Notes in Artificial Intelligence 3430
Lecture Notes in Artificial Intelligence 3430
Active Mining
13
Series Editors
Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Volume Editors
Shusaku Tsumoto
Shimane University, School of Medicine
Department of Medical Informatics
89-1 Enya-cho, Izumo, Shimane 693-8501, Japan
E-mail: [email protected]
Takahira Yamaguchi
Keio University, Faculty of Science and Technology
Department of Administration Engineering
3-14-1 Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan
E-mail: [email protected]
Masayuki Numao
Osaka University, The Institute of Scientific and Industrial Research
Division of Intelligent Systems Science, Dept. of Architecture for Intelligence
8-1 Mihogaoka, Ibaraki, Osaka 567-0047, Japan
E-mail: [email protected]
Hiroshi Motoda
Osaka University, The Institute of Scientific and Industrial Research
Division of Intelligent Systems Science, Dept. of Advance Reasoning
8-1 Mihogaoka, Ibaraki, Osaka 567-0047, Japan
E-mail: [email protected]
ISSN 0302-9743
ISBN-10 3-540-26157-5 Springer Berlin Heidelberg New York
ISBN-13 978-3-540-26157-5 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
Springer is a part of Springer Science+Business Media
springeronline.com
© Springer-Verlag Berlin Heidelberg 2005
Printed in Germany
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper SPIN: 11423270 06/3142 543210
Foreword
This volume contains the papers selected for presentation at the 2nd Interna-
tional Workshop on Active Mining (AM 2003) which was organized in conjunc-
tion with the 14th International Symposium on Methodologies for Intelligent
Systems (ISMIS 2003), held in Maebashi City, Japan, 28–31 October, 2003.
The workshop was organized by the Maebashi Institute of Technology in co-
operation with the Japanese Society for Artificial Intelligence. It was sponsored
by the Maebashi Institute of Technology, the Maebashi Convention Bureau, the
Maebashi City Government, the Gunma Prefecture Government, JSAI SIGKBS
(Japanese Artificial Intelligence Society, Special Interest Group on Knowledge-
Based Systems), a Grant-in-Aid for Scientific Research on Priority Areas (No.
759) “Implementation of Active Mining in the Era of Information Flood,” US
AFOSR/AOARD, the Web Intelligence Consortium (Japan), the Gunma Infor-
mation Service Industry Association, and Ryomo Systems Co., Ltd.
ISMIS is a conference series that was started in 1986 in Knoxville, Tennessee.
Since then it has been held in Charlotte (North Carolina), Knoxville (Tennessee),
Torin (Italy), Trondheim (Norway), Warsaw (Poland), Zakopane (Poland), and
Lyon (France).
The objective of this workshop was to gather researchers as well as practi-
tioners who are working on various research fields of active mining, share hard-
learned experiences, and shed light on the future development of active mining.
This workshop addressed many aspects of active mining, ranging from theories,
methodologies, and algorithms to their applications. We believe that it also pro-
duced a contemporary overview of modern solutions and it created a synergy
among different branches but with a similar goal — facilitating data collection,
processing, and knowledge discovery via active mining.
We express our appreciation to the sponsors of the symposium and to all
who submitted papers for presentation and publication in the proceedings. Our
sincere thanks especially go to the Organizing Committee of AM 2003: Shusaku
Tsumoto, Hiroshi Motoda, Masayuki Numao, and Takahira Yamaguchi. Also,
our thanks are due to Alfred Hofmann of Springer for his continuous support.
This volume contains the papers based on the tutorials of ISMIS 2003 and the
papers selected from the regular papers presented at the 2nd International Work-
shop on Active Mining (AM 2003) held as part of ISMIS 2003, held at Mae-
bashi, Gunma, October 28, 2003. (URL: http://www.med.shimane-u.ac.jp/
med info/am2003)
There were 38 paper submissions from Japan, US, Korea, China and Vietnam
for AM 2003. Papers went through a rigorous reviewing process. Each paper was
reviewed by at least three Program Committee members. When all the reviews
of a paper were in conflict, another PC member was asked to review this paper
again. Finally, 20 submissions were selected by the Program Committee members
for presentation.
The PC members who attended this workshop reviewed all the papers and
the presentations during the workshops and decided that 12 of them could be
accepted with minor revisions, and 8 of them could be accepted with major
revisions for the publication in the postproceedings. The authors were asked to
submit the revised versions of their papers and, again, the PC members reviewed
them for about four months. Through this process, we accepted 16 out of 20
papers. Thus, of 38 papers submitted, 16 were accepted for this postproceedings,
corresponding to an acceptance ratio of only 42.1%.
AM 2003 provided a forum for exchanging ideas among many researchers in
various areas of decision support, data mining, machine learning and information
retrieval and served as a stimulus for mutual understanding and cooperation.
The papers contributed to this volume reflect advances in active mining as
well as complementary research efforts in the following areas:
– Text mining
– Graph mining
– Success/failure stories in data mining and lessons learned
– Data mining for evidence-based medicine
– Distributed data mining
– Data mining for knowledge management
– Active learning
– Meta learning
– Active sampling
– Usability of mined pieces of knowledge
– User interface for data mining
We wish to express our gratitude to Profs. Zbigniew W. Raś and Ning Zhong,
who accepted our proposal on this workshop and helped us to publish the post-
proceedings. Without their sincere help, we could not have published this volume.
We wish to express our thanks to all the PC members, who reviewed the
papers at least twice, for the workshop and the postproceedings. Without their
contributions, we could not have selected high-quality papers with high
confidence.
We also want to thank all the authors who submitted valuable papers to
AM 2003 and all the workshop attendees.
This time, all the submissions and reviews were made through the CyberChair
system (URL: http://www.cyberchair.org/). We wish to thank the Cyber-
Chair system development team. Without this system, we could not have edited
this volume in such a speedy way.
We also extend our special thanks to Dr. Shoji Hirano, who launched the
CyberChair system for AM 2003 and contributed to editing this volume.
Finally, we wish to express our thanks to Alfred Hofmann at Springer for his
support and cooperation.
Organizing Committee
Hiroshi Motoda (Osaka University, Japan)
Masayuki Numao (Osaka University, Japan)
Takahira Yamaguchi (Keio University, Japan)
Shusaku Tsumoto (Shimane University, Japan)
Program Committee
Hiroki Arimura (Hokkaido University, Japan)
Stephen D. Bay (Stanford University, USA)
Wesley Chu (UCLA, USA)
Saso Dzeroski (Jozef Stefan Institute, Slovenia)
Shoji Hirano (Shimane University, Japan)
Tu Bao Ho (JAIST, Japan)
Robert H.P. Engels (CognIT, Norway)
Ryutaro Ichise (NII, Japan)
Akihiro Inokuchi (IBM Japan, Japan)
Hiroyuki Kawano (Nanzan University, Japan)
Yasuhiko Kitamura (Kwansei Gakuin University, Japan)
Marzena Kryszkiewicz (Warsaw University of Technology, Poland)
T.Y. Lin (San Jose State University, USA)
Bing Liu (University of Illinois at Chicago, USA)
Huan Liu (Arizona State University, USA)
Tsuyoshi Murata (NII, Japan)
Masayuki Numao (Osaka University, Japan)
Miho Ohsaki (Doshisha University, Japan)
Takashi Onoda (CRIPEI, Japan)
Luc de Raedt (University of Freiburg, Germany)
Zbigniew Raś (University of North Carolina, USA)
Henryk Rybinski (Warsaw University of Technology)
Masashi Shimbo (NAIST, Japan)
Einoshin Suzuki (Yokohama National University, Japan)
Masahiro Terabe (MRI, Japan)
Ljupico Todorovski (Jozef Stefan Institute, Slovenia)
Seiji Yamada (NII, Japan)
Yiyu Yao (University of Regina, Canada)
Kenichi Yoshida (University of Tsukuba, Japan)
Tetsuya Yoshida (Hokkaido University, Japan)
Stefan Wrobel (University of Magdeburg, Germany)
Ning Zhong (Maebashi Institute of Techonology, Japan)
Table of Contents
Overview
Active Mining Project: Overview
Shusaku Tsumoto, Takahira Yamaguchi, Masayuki Numao,
Hiroshi Motoda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Tutorial Papers
Computational and Statistical Methods in Bioinformatics
Tatsuya Akutsu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
called “information flood”, we can observe the following three serious problems
to be solved: (1) identification and collection of the relevant data from a huge
information search space, (2) mining useful knowledge from different forms of
massive data efficiently and effectively, and (3) promptly reacting to situation
changes and giving necessary feedback to both data collection and mining steps.
Due to these problems, all the organizations cannot be efficient without a so-
phisticated framework for data analysis cycle dealing with data collection, data
analysis and user reaction.
Active mining is proposed as a solution to these requirements, which collec-
tively achieves the various mining need. By “collectively achieving” we mean
that the total effect outperforms the simple add-sum effect that each individual
effort can bring. Especially, a active mining framework proposes a “spiral model”
of knowledge discovery rather than Fayyad’s KDD process [1] and focuses on the
feedback from domain experts in the cycle of data analysis (data mining).
In order to validate this framework, we have focused on common medical
and chemical datasets and developed techniques to realize this spiral model. As
a result, we have discovered knowledge unexpected to domain experts, whose
results are included in this postproceedings.
each research goal, but could experience the total process of active mining due
to these meetings.
5 In This Volume
This volume gives selection of the best papers presented at the Second Inter-
national Workshop on Active Mining (AM2003), which includes papers from
research groups in Active Mining Project, but also from other data miners. This
section categorizes the papers from the project into three subgoals and summa-
rizes those papers and contributions to our research goals.
5.1 Active Information Collection
Relevance Feedback Document Retrieval Using Support Vector Machines [2]. The
following data mining problems from the document retrieval were investigated:
From a large data set of documents, we need to find documents that relate
6 S. Tsumoto et al.
Micro View and Macro View Approaches to Discovered Rule Filtering [3]. The
authors try to develop a discovered rule filtering method to filter rules discovered
by a data mining system to be novel and reasonable ones by using information
retrieval technique. In this method, they rank discovered rules according to the
results of information retrieval from an information source on the Internet. The
paper shows two approaches toward discovered rule filtering; the micro view ap-
proach and the macro view approach. The micro view approach tries to retrieve
and show documents directly related to discovered rules. On the other hand,
the macro view approach tries to show the trend of research activities related
to discovered rules by using the results of information retrieval. They discuss
advantages and disadvantages of the micro view approach and feasibility of the
macro view approach by using an example of clinical data mining and MEDLINE
document retrieval.
Spiral Multi-Aspect Hepatitis Data Mining [8]. When therapy using IFN (in-
terferon) medication for chronic hepatitis patients, various conceptual knowl-
edge/rules will benefit for giving a treatment. The paper describes our work on
cooperatively using various data mining agents including the GDT-RS inductive
learning system for discovering decision rules, the LOI (learning with ordered
information) for discovering ordering rules and important features, as well as the
POM (peculiarity oriented mining) for finding peculiarity data/rules, in a spi-
ral discovery process with multi-phase such as pre-processing, rule mining, and
post-processing, for multi-aspect analysis of the hepatitis data and meta learn-
ing. Their methodology and experimental results show that the perspective of
medical doctors will be changed from a single type of experimental data analysis
towards a holistic view, by using our multi-aspect mining approach.
Sentence Role Identification in Medline Abstracts: Training Classifier with Struc-
tured Abstracts [9]. The abstract of a scientific paper typically consists of sen-
tences describing the background of study, its objective, experimental method
and results, and conclusions. The authors discuss the task of identifying which
of these “structural roles” each sentence in abstracts plays, with a particular
focus on its application in building a literature retrieval system. By annotat-
ing sentences in an abstract collection with role labels, we can build a litera-
ture retrieval system in which users can specify the roles of the sentences in
which query terms should be sought. They argue that this facility enables more
8 S. Tsumoto et al.
goal-oriented search, and also makes it easier to narrow down search results when
adding extra query terms does not work.
and capability to handle more complex data consisting of their relations. Nev-
ertheless, the bottleneck for learning first-order theory is enormous hypothesis
search space which causes inefficient performance by the existing learning ap-
proaches compared to the propositional approaches. The authors introduces an
improved ILP approach capable of handling more efficiently a kind of data called
multiple-part data, i.e., one instance of data consists of several parts as well as
relations among parts. The approach tries to find hypothesis describing class of
each training example by using both individual and relational characteristics of
its part which is similar to finding common substructures among the complex
relational instances.
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C [14].
Chance discovery, to discover events significant for making a decision, can be
regarded as the emergence of a scenario with extracting events at the turning
points of valuable scenarios, by means of communications exchanging scenarios
in the mind of participants. The authors apply a method of chance discovery to
the data of diagnosis of hepatitis patients, for obtaining scenarios of how the most
essential symptoms appear in the patients of hepatitis of type B and C. In the
process of discovery, the results are evaluated to be novel and potentially useful
for treatment, under the mixture of objective facts and the subjective focus of
the hepatologists’ concerns. Hints of the relation between f iron metabolism and
hepatitis cure, the effective condition for using interferon, etc. has got visualized.
Acknowledgment
This work was supported by the Grant-in-Aid for Scientific Research on Prior-
ity Areas (No.759) “Implementation of Active Mining in the Era of Information
Flood” by the Ministry of Education, Science, Culture, Sports, Science and Tech-
nology of Japan.
References
1. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P.: The kdd process for extracting useful
knowledge from volumes of data. CACM 29 (1996) 27–34
2. Onoda, T., Murata, H., Yamada, S.: Relevance feedback document retrieval using
support vector machines. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda,
H., eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence,
Berlin, Springer (2005) 59–73
3. Kitamura, Y., Iida, A., Park, K.: Micro view and macro view approaches to dis-
covered rule filtering. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda, H.,
eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence, Berlin,
Springer (2005) 74–91
4. Geamsakul, W., Yoshida, T., Ohara, K., Motoda, H., Washio, T., Yokoi, H., Kat-
suhiko, T.: Extracting diagnostic knowledge from hepatitis dataset by decision
tree graph-based induction. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda,
H., eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence,
Berlin, Springer (2005) 128–154
10 S. Tsumoto et al.
5. Yada, K., Hamuro, Y., Katoh, N., Washio, T., Fusamoto, I., Fujishima, D., Ikeda,
T.: Data mining oriented crm systems based on musashi: C-musashi. In Tsumoto,
S., Yamaguchi, T., Numao, M., Motoda, H., eds.: Active Mining. Volume 3430 of
Lecture Notes in Artificial Intelligence, Berlin, Springer (2005) 155–176
6. Ohsaki, M., Kitaguchi, S., Yokoi, H., Yamaguchi, T.: Investigation of rule in-
terestingness in medical data mining. In Tsumoto, S., Yamaguchi, T., Numao,
M., Motoda, H., eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial
Intelligence, Berlin, Springer (2005) 177–193
7. Yamada, Y., Suzuki, E., Yokoi, H., Takabayashi, K.: Experimental evaluation of
time-series decision tree. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda,
H., eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence,
Berlin, Springer (2005) 194–214
8. Ohshima, M., Okuno, T., Fujita, Y., Zhong, N., Dong, J., Yokoi, H.: Spiral multi-
aspect hepatitis data mining. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda,
H., eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence,
Berlin, Springer (2005) 215–241
9. Shimbo, M., Yamasaki, T., Matsumoto, Y.: Sentence role identification in medline
abstracts: Training classifier with structured abstracts. In Tsumoto, S., Yamaguchi,
T., Numao, M., Motoda, H., eds.: Active Mining. Volume 3430 of Lecture Notes in
Artificial Intelligence, Berlin, Springer (2005) 242–261
10. Hirano, S., Tsumoto, S.: Empirical comparison of clustering methods for long
time-series databases. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda, H.,
eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence, Berlin,
Springer (2005) 275–294
11. Okada, T., Yamakawa, M., Niitsuma, H.: Spiral mining using attributes from 3d
molecular structures. In Tsumoto, S., Yamaguchi, T., Numao, M., Motoda, H.,
eds.: Active Mining. Volume 3430 of Lecture Notes in Artificial Intelligence, Berlin,
Springer (2005) 295–310
12. Takahashi, Y., Nishikoori, K., Fujishima, S.: Classification of pharmacological
activity of drugs using support vector machine. In Tsumoto, S., Yamaguchi, T.,
Numao, M., Motoda, H., eds.: Active Mining. Volume 3430 of Lecture Notes in
Artificial Intelligence, Berlin, Springer (2005) 311–320
13. Nattee, C., Sinthupinyo, S., Numao, M., Okada, T.: Mining chemical compound
structure data using inductive logic programming. In Tsumoto, S., Yamaguchi,
T., Numao, M., Motoda, H., eds.: Active Mining. Volume 3430 of Lecture Notes
in Artificial Intelligence, Berlin, Springer (2005) 92–113
14. Ohsawa, Y., Fujie, H., Saiura, A., Okazaki, N., Matsumura, N.: Cooperative sce-
nario mining from blood test data of hepatitis b and c. In Tsumoto, S., Yamaguchi,
T., Numao, M., Motoda, H., eds.: Active Mining. Volume 3430 of Lecture Notes
in Artificial Intelligence, Berlin, Springer (2005) 321–344
Computational and Statistical Methods
in Bioinformatics
Tatsuya Akutsu
1 Introduction
Due to progress of high throughput experimental technology, complete genome
sequences of many organisms have been determined. However, it is unclear how
organisms are controlled by genome sequences. Therefore, it is important to de-
cipher meanings of genomes and genes. Though many experimental technologies
have been developed for that purpose, it is also important to develop informa-
tion technologies for analyzing genomic sequence data and related data (e.g.,
protein structures, gene expression patterns) because huge amount of data are
being produced. One of major goals of bioinformatics (and almost equivalently,
computational biology) is to develop methods and software tools for supporting
such analysis.
In this paper, we overview computational and statistical methods developed
in bioinformatics. Since bioinformatics has become a very wide area, it is difficult
to make comprehensive survey. Thus, we focus on a recent and important topic:
kernel methods in bioinformatics. Though kernel methods can be used in several
ways, the most studied way is with the support vector machine (SVM) [5, 6],
where SVM is a kind of machine learning algorithms and is based on statistical
learning theory. SVMs provide a good way to combine computational methods
with statistical methods. We shows how computational and statistical methods
are combined using SVMs in bioinformatics. It is worthy to note that we do
not intend to give comprehensive survey on SVMs in bioinformatics. Instead,
This paper is based on a tutorial at ISMIS 2003 conference.
we focus on SVMs for analyzing biological sequence data and chemical structure
data, and try to explain these in a self-contained manner so that readers not
familiar with bioinformatics (but familiar with computer science) can understand
the ideas and methods without reading other papers or books.
Organization of this paper is as follows. First, we overview sequence align-
ment, which is used to compare and search similar biological sequences. Sequence
alignment is one of the most fundamental and important computational meth-
ods in bioinformatics. Next, we overview computational methods for discovery
of common patterns from sequences with common properties. Then, we overview
the Hidden Markov Model (HMM) and its applications to bioinformatics, where
HMM is a statistical model for generating sequences. Though HMMs were orig-
inally developed in other fields such as statistics and speech recognition, these
have been successfully applied to bioinformatics since early 1990’s. Then, we
overview the main topic of this paper: recent development of kernel methods in
bioinformatics. In particular, we focus on kernel functions for measuring sim-
ilarities between biological sequences because designing good kernel functions
is a key issue for applying SVMs to analysis of biological sequences. We also
overview the marginalized graph kernel, which can be used for comparing chem-
ical structures. It should be noted that chemical structures are also important
in organisms and understanding of interaction between chemical structures and
proteins of one of key issues in bioinformatics. Finally, we conclude with other
applications of kernel methods in bioinformatics.
2 Sequence Alignment
Sequence alignment is a fundamental and important problem in bioinformatics
and is used to compare two or multiple sequences [7, 26, 31]. Sequence align-
ment is classified into two categories: pairwise sequence alignment and multiple
sequence alignment.
First, we briefly review pairwise sequence alignment. Let A be the set of
nucleic acids or the set of amino acids (i.e., |A| = 4 or |A| = 20). For each
sequence s over A, |s| denotes the length of s and s[i] denotes the i-th letter of s
(i.e., s = s[1]s[2] . . . s[n] if |s| = n). Then, a global alignment between sequences
s and t is obtained by inserting gap symbols (denoted by ‘-’) into or at either
end of s and t such that the resulting sequences s and t are of the length l,
where it is not allowed for each i ≤ l that both s [i] and t [i] are gap symbols.
Score matrix f (a, b) is a function from A × A to the set of reals, where A =
A ∪ {−}. We reasonably assume f (a, b) = f (b, a) and f (a, −) = f (−, b) = −d
(d > 0) for all a, b ∈ A, f (−, −) = 0 and f (a, a) > 0 for all a ∈ A. Then, the
score of alignment (s , t ) is defined by
l
score(s , t ) = f (s [i], t [i]).
i=1
O(mn) time using dynamic programming procedures even if this afine gap cost
is employed [7]. The Smith-Waterman algorithm with afine gap cost is most
widely used for comparing two sequences.
The above dynamic programming algorithms are fast enough to compare
two sequences. But, in the case of homology search (search for similar sequences),
pairwise alignment between the query sequence and all sequences in the database
should be performed. Since more than several hundreds thousands of sequences
are usually stored in the database, simple application of pairwise alignment
would take a lot of time. Therefore, several heuristic methods have been pro-
posed for fast homology search, among which FASTA and BLAST are widely
used [26]. Most of heuristic methods employ the following strategy: candidate
sequences having fragments (short length substrings) which are the same as (or
very similar to) a fragment of the query sequence are first searched and then
pairwise alignments are computed using these fragments as anchors. Using these
methods, homology search against several hundreds thousands sequences can be
done in several seconds.
Next, we briefly review multiple sequence alignment. In this case, more than
two sequences are given. Let s1 , s2 , . . . , sh be input sequences. As in pairwise
alignment, an alignment for s1 , s2 , . . . , sh is obtained by inserting gap symbols
into or at either end of si such that the resulting sequences s1 , s2 , . . . , sh are of the
same length l. For example, consider three sequences AGCCAGTG, GCCGTGG,
AGAGAGG, Then, the followings are examples of alignments.
M8 M9 M10
AGCCAGTG- AGCCAGT-G AGCCAGT-G-
-GCC-GTGG -GCC-GTGG -GCC-GT-GG
AG--AGAGG AG--AGAGG -AGA-G-AGG
Though there are various scoring schemes for multiple alignment, SP-score
(Sum-of-Pairs score) is simple and widely used. SP-score is defined by
l
score(s1 , · · · , sh ) = f (sp [i], sq [i]).
1≤p<q≤h i=1
Using the score function defined before, the score of both M8 and M9 is 3, and
the score of M10 is -5. In this case, both M8 and M9 are optimal alignments.
The dynamic programming technique for pairwise alignment can be extended
for multiple alignment. But, it is not practical because it takes O(2h ·nh ) time [7],
where n is the maximum length of input sequences. Indeed, multiple alignment
is known to be NP-hard if h is a part of an input (i.e., h is not fixed) [36]. Thus,
a variety of heuristic methods have been applied to multiple alignment, which
include simulated annealing, evolutionary computation, iterative improvement,
branch-and-bound search, and stochastic methods [7, 29].
Among them, progressive strategy is widely used [7, 34]. In this strategy, we
need alignment between two profiles, where a profile corresponds to a result of
alignment. Alignment between profiles can be computed in a similar way as in
Computational and Statistical Methods in Bioinformatics 15
T C C- GA -
T C - - GA G
- C CA GA G
- C GA - A G
T CCGA - CCA GA G
T C- GA G CGA - A G
(i) Construct a distance matrix for all pairs of sequences by pairwise sequence
alignment, followed by conversion of alignment scores into distances using
an appropriate method.
(ii) Construct a rooted tree whose leaves correspond to input sequences, using
a method for phylogenetic tree construction.
(iii) Progressively perform sequence-sequence, sequence-profile and profile-profile
alignment at nodes in order of decreasing similarity.
Though we have assumed that score functions were given, derivation of score
functions is also important. Score functions are usually derived by taking log-
ratios of frequencies [11]. However, some other methods have been proposed for
optimizing score functions [12, 14].
profile
w(x,1) w(x,2) w(x,3)
ACT GAACAT AG
A 2.2 2.8 -2.1
C 3.2 -2.5 -1.7
G -1.5 0.2 3.5 Σ w(x,i) =10.0 Σ w(x,i) =9.1
T 0.5 3.3 3.1
motif motif
threshold: θ =9.0
Fig. 2. Motif detection using profile
n
P (s, π|Θ) = aπ[i−1]π[i] eπ[i] (s[i]),
i=1
It is not difficult to see that the Viterbi algorithm works in O(nm2 ) time. Once
ck (i)’s are computed, π ∗ (s) can be obtained using the traceback technique as in
sequence alignment.
The Forward algorithm computes the probability that a given sequence is
generated. It computes
P (s|Θ) = P (s, π|Θ)
π
when sequence s is given. This probability can also be computed using dynamic
programming. Let fk (i) be the probability of emitting prefix s[1 . . . i] of s and
reaching the state qk . Then, the following is the core part of the Forward algo-
rithm:
fk (i) = ek (s[i]) · fl (i − 1) · alk .
ql ∈Q
It should be noted that the only difference between the Forward algorithm and
the Viterbi algorithm is that ‘max’ in the Viterbi algorithm is replaced with ‘Σ’
in the Forward algorithm.
Here, we define the backward probability bk (i) as the probability of being at
state qk and emitting the suffix s[i + 1 . . . n] of s. Then, the following is the core
part of the Backward algorithm:
bk (i) = el (s[i + 1]) · bl (i + 1) · akl .
ql ∈Q
Using fk (i) and bk (i), we can compute the probability that an HMM takes
state qk at i-th step (i.e., just after emitting s[i]) is given by
P (s, qπ[i] = qk ) fk (i) · bk (i)
P (qπ[i] = qk |s) = = .
P (s) qk ∈Q fk (|s|)
18 T. Akutsu
a11
q1 e1(A)=0.3
a01 e1(C)=0.2
e1(G)=0.1
e1(T)=0.4
q0 a21 a12
a02 e2(A)=0.1
e2(C)=0.4
q2 e2(G)=0.3
a22 e2(T)=0.2
Fig. 3. Example of HMM
h
1
Ek (b) = fkj (i)bjk (i),
j=1
P (sj |Θ)
i|sj [i]=b
Computational and Statistical Methods in Bioinformatics 19
ΗΜΜ D D D
π∗( s 3 )
I I I I
π∗( s 2 )
π∗( s 1 )
BEGIN M M M END
alignment
s1 A G C
s 2 A C G C state
D
s 3 A C
state M I M M
Fig. 4. Computation of multiple alignment using profile HMM
where fkj (i) and bjk (i) denote fk (i) and bk (i) for sequence sj respectively. Then,
we can obtain a new set of parameters âkl and êk (b) by
Akl Ek (b)
âkl = , êk (b) =
.
ql ∈Q Akl b ∈A Ek (b )
It is proved that this iterative procedure does not decrease the likelihood [7].
HMMs are applied to bioinformatics in various ways. One common way is the
use of profile HMMs. Recall that a profile is a function w(x, j) from A × [1 . . . L]
to R, where L denotes the length of a motif region. Given a sequence s, the
score for s was defined by j=1,...,L w(s[j], j). Though profiles are useful to
detect short motifs, these are not so useful to detect long motifs or remote
homologs (sequences having weak similarities) because insertions or deletions
are not allowed. A profile HMM is considered to be an extension of a profile
such that insertions and deletions are allowed.
A profile HMM has a special architecture as shown in Fig. 4. The states are clas-
sified into three types: match states, insertion states and deletion states. A match
state corresponds to one position in a profile. A symbol b is emitted from a match
state qi according to probability ei (b). A symbol b is emitted with probability P (b)
from any insertion state qi , where P (b) is the background frequency of occurrence
of the symbol b. No symbol is emitted from any deletion state. Using a profile
HMM, we can also obtain multiple alignment of sequences by combining π ∗ (sj )
for all input sequences sj . Though alignments obtained by profile HMMs are not
necessarily optimal, these are meaningful from a biological viewpoint [7].
Using profile HMMs, we can classify sequences. Though there are various
criteria for sequence classification, one of well-known criteria is classification
20 T. Akutsu
Class 1
s1 Foward
or
s2 EM Viterbi
HMM1 P(s | HMM1)=0.2
s3
new sequence
Class 2 s belongs to
s
Class 2
s’1
s’2 EM
HMM2 P(s | HMM2)=0.3
s’3 Foward
or
Viterbi
Fig. 5. Classification of protein sequences using profile HMMs. In this case, a new
sequence s is predicted to belong to Class 2 because the score for HMM2 is greater
than that for HMM1
Though only HMM and profile HMM are explained in this section, a lot
of variants and extensions of HMMs have also been developed and applied to
various problems in bioinformatics. For example, a special kind of HMMs have
been applied to finding gene-coding regions in DNA sequences and stochastic
context-free grammar (an extension of HMM) have been applied to prediction
of RNA secondary structures [7].
5 String Kernel
Support vector machines (SVMs) have been widely applied to various problems in
artificial intelligence, pattern recognition and bioinformatics since the SVM was
proposed by Cortes and Vapnik in 1990’s [5]. In this section, we overview methods
for applying SVMs to classification of biological sequences. In particular, we
overview kernel functions (to be explained below) for biological sequences.
γ
γ
Fig. 6. SVM finds hyperplane h with the maximum margin γ such that positive exam-
ples (denoted by circles) are separated from negative examples (denoted by crosses).
In this case, a new example denoted by white square (resp. black square) is inferred as
positive (resp. negative)
X Φ Rd
h
AAGCTAAT
AAGCTGAT
AAGGTAATT
AAGCTAATT
GGTTGGAGG
AAGCTGTA
GGTTTTGGA
GGCTTATG
GGCTTCTAA
Φ Φ
Fig. 7. Feature map Φ from sequence space X to d-dimensional Euclidean space Rd .
Φ maps each sequence in X to a point in Rd
|s|−k+1
Φ(k,m) (s) = φ(k,m) (s[i . . . i + k − 1]).
i=1
It should be noted that Φ(k,0) coincides with the feature map of the spectrum
kernel.
There is another way to extend the spectrum kernel. The motif kernel was
proposed using a set of motifs in place of a set of short substrings [2]. Let M
be a set of motifs and occM (m, s) denote the number of occurrences of motif
m ∈ M in sequence s. Then, we define the feature map ΦM (s) by
Since the number of motifs may be very large, an efficient method is employed
for searching motifs [2].
5.3 SVM-Pairwise
As mentioned in Section 2, the score obtained by the Smith-Waterman algo-
rithm (SW-score) is widely used as a standard measure of similarity between
two protein sequences. Therefore, it is reasonable to develop kernels based on
the SW-score (i.e., maxi,j F [i][j]). Liao and Noble proposed a simple method to
derive a feature vector using the SW-score [23]. For two sequences s and s , let
sw(s, s ) denote the SW-score between s and s . Let S = {s1 , . . . , sn } be a set
of sequences used as training data, where some sequences are used as positive
examples and the others as negative examples. For each sequence s (s may not
be in S), we define a feature vector ΦSW (s) by
Then, the kernel is simply defined as ΦSW (s) · ΦSW (s ). In order to apply this
kernel to analysis of real sequence data, some normalization procedures are re-
quired. Details of normalization procedures are given in [23].
Using the Gaussian RBF kernel [6], the SVM-Fisher kernel is defined as
π: 1 2 2 1 2 2 2
q1 s: A C G G T T A
q0 (A,1)=1 (C,1)=0 (G,1)=1 (T,1)=0
(A,2)=1 (C,2)=1 (G,2)=1 (T,2)=2
q2
where caq (z) denotes the frequency of occurrences of a pair (a, q) in z. A vector
of caq (z) (i.e., (caq (z))(a,q)∈A×Q ) is considered as a feature vector Φ(z). This
caq (z) can be rewritten as
1
k
caq (z) = δ(s[i], a)δ(qπ[i] , q),
k i=1
where the summation is taken over all possible state sequences and k = |s| can
be different from k = |s |. This kernel is rewritten as
K(s, s ) = γaq (s)γaq (s ),
a∈A q∈Q
where γaq (s) denotes the marginalized count (see Fig. 8) defined by
1 k
γaq (s) = P (π|s) δ(s[i], a)δ(qπ[i] , q)
k π i=1
1
k
= P (qπ[i] = q|s)δ(s[i], a).
k i=1
26 T. Akutsu
It should be noted that linearity of expectations is used to derive the last equality.
As described in Section 4, P (qπ[i] = q|s) can be computed using the Forward
and Backward algorithms.
l
score(s , t ) = f (s [i], t [i]).
i=1
Let Π(s, t) be the set of all possible local alignments between sequences s and
(β)
t. We define the local alignment kernel KLA (s, t) by
exp(β · score(s , t )),
(β)
KLA (s, t) =
(s ,t )∈Π(s,t)
1
sw(s, t) = max score(s , t ) = log max exp(β · score(s
, t )) .
(s ,t )∈Π(s,t) β (s ,t )∈Π(s,t)
SVM-pairwise
positive and negative
training sequences
SW-
feature vector Φ(s)
alignment
input seq. s
positive EM feature
training sequences HMM
Forward/ vector
Backwrad Φ(s)
input seq. s
Kernels in class (i) (especially, spectrum kernel) have a merit that kernel val-
ues can be computed very efficiently. Kernels in classes (ii) and (iv) require longer
time (especially in training phase) because dynamic programing algorithms take
quadratic time. In order to compute kernel values in class (iii), it is required to
train HMMs, where the Baum-Welch algorithm is usually employed to train these
HMMs. Therefore, the performance may depend on how HMMs are trained.
These kernels except the marginalized count kernel were experimented with
the benchmark data set for detecting remote homology at the SCOP superfamily
28 T. Akutsu
level [2, 13, 21, 23, 32]. In order to apply kernel methods to remote homology
detection, we develop some methods for combining the results from multiple
SVMs since an SVM can classify only two classes (i.e., positive and negative).
For that purpose, there are two approaches: (a) simply selecting the class for
which the corresponding SVM outputs the highest score, (b) using majority
votes from multiple SVMs. In the case of remote homology detection, the former
approach (a) is usually used [2, 13, 21, 23, 32]. In this approach, we first construct
an SVM for each class, where sequences in the class are used as positive examples
and sequences not in the class are used as negative examples. Then, we apply a
test sequence to each SVM and the SVM with the highest score is selected.
The results of experiments with the benchmark data set suggest that kernel
based methods are in general better than a simple HMM search or a simple
homology search using the SW-score. The results also suggest the following re-
lation:
6 Graph Kernel
Inference of properties of chemical compounds is becoming important in bioin-
formatics as well as in other fields because it has potential applications to drug
design. In order to process chemical compounds in computers, chemical com-
pounds are usually represented as labeled graphs. Therefore, it is reasonable to
develop kernels to measure similarities between labeled graphs. However, to my
knowledge, only a few kernels have been developed. In this section, we review
the marginalized graph kernel [8, 16], which is a marginalized kernel for graphs.
It should be noted that other types of graph kernels [8, 17] and some extensions
of the marginalized graph kernel [24] were proposed.
A labeled graph G = (V, E) is defined by a set of vertices V , a set of undirected
edges E ⊆ V × V , and a labeling function l from V to A, where A is the set
of atom types (i.e., A = {H,O,C,N,P,S,Cl, . . .}). Though we only consider labels
of vertices for simplicity in this paper, the methods can be extended for cases
where labels of edges are taken into account. For graph G, V k denotes the set of
Computational and Statistical Methods in Bioinformatics 29
G START
0.25 0.25
0.25 u3
0.3 O 0.25
u1 0.3 u2
0.9
H C 0.3
0.9
0.9
u4
0.1 0.1 Cl 0.1
0.1
END
Fig. 10. Example probability distribution on a graph, where (1 − Pq (v))Pa (u|v) is
shown for each solid arrow. For π = u1 u2 u3 , l(π) = (H,C,O) and P (π) = 0.25 · 0.9 ·
0.3 · 0.1. For π = u2 u4 u2 u3 , l(π ) = (C,Cl,O,C) and P (π ) = 0.25 · 0.3 · 0.9 · 0.3 · 0.1
sequences of k vertices, and V ∗ denotes the set of all finite sequences of vertices
∗ ∞
(i.e., V = k=1 V k ). The labeling function l can be extended for sequences
of vertices by letting l(π) = (l(π[1]), l(π[2]), . . . , l(π[k])) for each element π =
π[1]π[2] . . . π[k] ∈ V k . If π ∈ V k satisfies the property that (π[i], π[i + 1]) ∈ E
for all i = 1, . . . , k − 1, π is called a path.
Here we consider probability distributions on paths (see Fig. 10). For each
π ∈ V ∗ , we define P (π) by
k
P (π) = P0 (π[1]) · Pa (π[i]|π[i − 1]) · Pq (π[k])
i=2
where
P0 (v), Pa (u|v) and Pq (v) denote an initial probability distribution
on V
( v∈V Ps (v) = 1), a transition probability distribution on V ×V ( u∈V Pa (u|v)
= 1 for each v ∈ V ) and a stopping probability distribution on V , respec-
tively. For (v, u) ∈
/ E, Pa (u|v) = 0 must hold. Letting Ps (v) = P0 (v)Pq (v) and
Pt (u|v) = (1 − Pq (v))Pa (u|v)Pq (u)/Pq (v), we can rewrite P (π) as
k
P (π) = Ps (π[1]) Pt (π[i]|π[i − 1]).
i=2
v3
G u3 G’ H
u1 u2 O v1 v2 v5 v6
H C u4 H C O H
Cl v4
H
Fig. 11. Example of a pair of graphs. For paths π = u1 u2 u3 of G and π = v1 v2 v5
of G , l(π) = l(π ) = (H,C,O) holds, from which KZ (l(π), l(π )) = 1 follows. For
paths π = u1 u2 u3 u2 u4 of G and π = v1 v2 v5 v2 v1 of G , l(π ) = (H,C,O,C,Cl) and
l(π ) = (H,C,O,C,H) hold, from which KZ (l(π ), l(π )) = 0 follows
where G and G correspond to x and x respectively and are omitted in the right
hand side of this formula.
The above definition involves a summation over an infinite number of paths.
However, it can be computed efficiently using matrix inversions if KZ satis-
fies some property [8]. Here, we assume that KZ is the Dirac kernel. Let π =
u1 u2 . . . uk and π = v1 v2 . . . vk be paths of G and G such that KZ (l(π), l(π )) =
1. Let Π be the set of such pairs of paths. We define γ(π, π ) by
γ(π, π ) = P (π) · P (π ).
Then, we have
K(G, G ) = γ(π, π ).
(π,π )∈Π
If we define the |V | · |V | dimensional vector γs = (γs (v, v ))(v,v )∈V ×V and the
(|V |·|V |)×(|V |·|V |) transition matrix Γt = (γt ((u, u )|(v, v )))(u,u ,v,v )∈V×V ×V×V ,
we have:
γ(π, π ) = γst (Γt )k−1 1,
(π,π )∈Π,|π|=|π |=k
Computational and Statistical Methods in Bioinformatics 31
7 Conclusion
We overviewed fundamental computational and statistical methods in bioinfor-
matics and recent advances in kernel methods for bioinformatics. Kernel methods
have been a very active and exciting field in the last few years. As seen in this
paper, a variety of kernel functions were developed for analyzing sequence data
in a few years, where fundamental techniques in bioinformatics were efficiently
combined with support vectors machines.
Though we focused on string kernels and graph kernels, kernel methods have
been applied to other problems in bioinformatics, which include analysis of
gene expression data [19, 27], analysis of phylogenetic profiles [39], analysis of
metabolic pathways [40], prediction of protein subcellular locations [30], pre-
diction of protein secondary structures [37], classification of G-protein coupled
receptors [15], and inference of protein-protein interactions [3, 10]. Though we
focused on SVMs, kernel functions can also be used with other methods, which
include principal component analysis (PCA) and canonical correlation analysis
(CCA) [6, 40]. More details and other application of kernel methods in bioinfor-
matics can also be found in [33].
Huge amount of various types of biological data are being produced every-
day. Therefore, it is needed to develop more efficient and flexible methods for
extracting important information behind those data. Kernel methods will be one
of key technologies for that purpose. More and more studied should be done for
development of other kernel functions as well as improvement of support vector
machines and other kernel based methods.
Acknowledgements
This work was partially supported by a Japanese-French SAKURA grant named
“Statistical and Combinatorial Analysis of Biological Networks”. The author
would like to thank Jean-Philippe Vert and other members of that project,
through which he learned much about kernel methods.
32 T. Akutsu
References
1. Bailey, T. L. and Elkan, C., Fitting a mixture model by expectation maximization
to discover motifs in biopolymers, Proc. Second International Conf. on Intelligent
Systems for Molecular Biology, 28-36, 1994.
2. Ben-Hur, A. and Brutlag, D., Remote homology detection: a motif based approach,
Bioinformatics, 19:i26-i33, 2003.
3. Bock, J. R. and Gough, D. A., Predicting protein-protein interactions from primary
structure, Bioinformatics, 17:455-460, 2001.
4. Brazma, A., Jonassen, I., Eidhammer, I. and Gilbert, D., Approaches to the au-
tomatic discovery of patterns in biosequences, Journal of Computational Biology,
5:279-305, 1998.
5. Cortes, C. and Vapnik, V., Support vector networks, Machine Learning, 20:273-
297, 1995.
6. Cristianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines
and Other Kernel-Based Learning Methods, Cambridge Univ. Press, 2000.
7. Durbin, R., Eddy, S., Krogh, A. and Mitchison, G., Biological Sequence Analysis.
Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press,
1998.
8. Gärtner, T., Flach, P. and Wrobel, S., On graph kernels: Hardness results and
efficient alternatives, Proc. 16th Annual Conf. Computational Learning Theory
(LNAI 2777, Springer), 129-143, 2003.
9. Haussler, D., Convolution kernels on discrete structures, Technical Report, UC
Santa Cruz, 1999.
10. Hayashida, M., Ueda, N. and Akutsu, T., Inferring strengths of protein-protein
interactions from experimental data using linear programming, Bioinformatics,
19:ii58-ii65. 2003.
11. Henikoff, A. and Henikoff, J. G., Amino acid substitution matrices from protein
blocks, Proc. Natl. Acad. Sci. USA, 89:10915-10919, 1992.
12. Hourai, Y., Akutsu, T. and Akiyama, Y., Optimizing substitution matrices by
separating score distributions, Bioinformatics, 20:863-873, 2004.
13. Jaakola, T., Diekhans, M. and Haussler, D., A discriminative framework for de-
tecting remote protein homologies, Journal of Computational Biology, 7:95-114,
2000.
14. Kann, M., Qian, B. and Goldstein, R. A., Optimization of a new score function
for the detection of remote homologs, Proteins: Structure, Function, and Genetics,
41:498-503, 2000.
15. Karchin, R., Karplus, K. and Haussler, D., Classifying G-protein coupled receptors
with support vector machines, Bioinformatics, 18:147-159, 2002.
16. Kashima, J., Tsuda, K. and Inokuchi, A., Marginalized kernels between labeled
graphs, Proc. 20th Int. Conf. Machine Learning, 321-328, AAAI Press, 2003.
17. Kondor, R. I. and Lafferty. J. D., Diffusion kernels on graphs and other discrete
input spaces, Proc. 19th Int. Conf. Machine Learning, 315-322, AAAI Press, 2002.
18. Lawrence, C. E., Altschul, S. F., Boguski, M. S., Liu, J. S., Neuwald, A. F. and
Wootton, J. C., Detecting subtle sequence signals: a Gibbs sampling strategy for
multiple alignment, Science, 262:208-214, 1993.
19. Lee, Y. and Lee, C-K., Classification of multiple cancer types by multicategory
support vector machines using gene expression data, Bioinformatics, 19:1132-1139,
2003.
20. Leslie, C., Eskin, E. and Noble, W. E., The spectrum kernel: a string kernel for
svm protein classification, Proc. Pacific Symp. Biocomputing 2002, 564-575, 2002.
Computational and Statistical Methods in Bioinformatics 33
21. Leslie, C., Eskin, E., Cohen, A., Weston, J. and Noble, W. E., Mismatch string
kernels for discriminative protein classification, Bioinformatics, 20:467-476, 2004.
22. Levitt, M., Gernstein, M., Huang, E., Subbiah, S. and Tsai, J., Protein folding:
The endgame, Annual Review of Biochemistry, 66:549-579, 1997.
23. Liao, L. and Noble, W. S., Combining pairwise sequence similarity and support
vector machines for detecting remote protein evolutionary and structural relation-
ships, Journal of Computational Biology, 10:857-868, 2003.
24. Mahé, P., Ueda, N., Akutsu, T., Perret, J-L. and Vert, J-P., Extensions of marginal-
ized graph kernels, Proc. 21st Int. Conf. Machine Learning, 552-559, AAAI Press,
2004.
25. Moult, J., Fidelis, K., Zemla, A. and Hubbard, T., Critical assessment of methods
for protein structure prediction (CASP)-round V, Proteins: Structure, Function,
and Genetics, 53, 334-339, 2003.
26. Mount, D. W., Bioinformatics: Sequence and Genome Analysis, Cold Spring Har-
bor Laboratory Press, 2001.
27. Mukherjee, S. et al., Estimating dataset size requirements for classifying DNA
microarray data, Journal of Computational Biology, 10:119-142, 2003.
28. Murzin, A. G. et al., SCOP: A structural classification of proteins database for the
investigation of sequences and structures, Journal of Molecular Biology, 247:536-
540, 1995.
29. Notredame, C., Recent progresses in multiple sequence alignment: A survey, Phar-
macogenomics, 3:131-144, 2002.
30. Park, K-J. and Kanehisa, M., Prediction of protein subcellular locations by sup-
port vector machines using compositions of amino acids and amino acid pairs,
Bioinformatics, 19:1656-1663, 2003.
31. Pevzner, P. A. Computational Molecular Biology. An Algorithmic Approach, The
MIT Press, 2000.
32. Saigo, H., Vert, J-P., Ueda, N. and Akutsu, T., Protein homology detection using
string alignment kernels, Bioinformatics, 20:1682-1689, 2004.
33. Schölkopf, B, Tsuda, K. and Vert, J-P., Kernel Methods in Computational Biology,
MIT Press, 2004.
34. Thompson, J., Higgins, D. and Gibson, T., CLUSTAL W: Improving the sensitivity
of progressive multiple sequence alignment through sequence weighting position-
specific gap penalties and weight matrix choice, Nucleic Acids Research, 22:4673-
4390, 1994.
35. Tsuda, K., Kin, T. and Asai, K., Marginalized kernels for biological sequences,
Bioinformatics, 18:S268-S275, 2002.
36. Wang, L. and Jiang, T., On the complexity of multiple sequence alignment, Journal
of Computational Biology, 1:337–348, 1994.
37. Ward, J. J., McGuffin, L. J., Buxton, B. F. and Jones, D. T., Secondary structure
prediction with support vector machines, Bioinformatics, 19:1650-1655, 2003.
38. Watkins, C., Dynamic alignment kernels, Advances in Large Margin Classifiers,
39-50, MA, MIT Press, 2000.
39. Vert, J-P., A tree kernel to analyse phylogenetic profiles, Bioinformatics, 18:S276-
S284, 2002.
40. Yamanishi, Y., Vert, J-P., Nakaya, A. and Kanehisa, M., Extraction of correlated
gene clusters from multiple genomic data by generalized kernel canonical correla-
tion analysis, Bioinformatics, 19:i323-i330, 2003.
Indexing and Mining Audiovisual Data
1 Introduction
1.1 Motivation
Indexing document has been an important task from the time, when we began
to collect document in libraries, and other administration offices. It has become
a technological challenge with the increase of document production, which has
followed the computer development. It is now of strategic importance for com-
panies, administrations, education and research centres to be able to handle and
manage the morass of documents being produced each day, worldwide. Docu-
ment structuring languages such as XML, search engines, and other automatic
indexing engines has been the subject of numerous research and development
in the last years. We will not go further in the discussion of the needs for gen-
eral document indexing work. We will focus on the more specialised domain of
audio-visual documents.
Audio visual documents call for several specific action: archiving, legal de-
posit, restoring, and easy access allowing [5, 6, 7].
Table 2. Legal deposit (professional archives are partially included in legal deposit)
(source INA 2003)
Hours of TV Hours of Radio
430 000 hours 500 000 hours
For the task of archiving, almost each country has an institute, an adminis-
tration or a company, which is in charge of recording, registering and storing the
flow of radio and TV programs, which are broadcast each day. The table 1 gives
the figures of the Professional Archives of the Audio-visual National Institute
(INA, France). They correspond to a total of 1.1 million hours of radio and TV
broadcasting, or approximately 70 000 hours a year of audio and audio-visual
programs (for the French broadcasting systems) since 1995.
The legal deposit task corresponds to information concerning their broadcast-
ing content given by the radio and TV broadcasting companies. This information
is used, for example, to pay the corresponding right to the concerned profession-
als (authors, actors, composers, etc.). The figures of the legal deposit at INA is
given in table 2.
It corresponds to a total of 930 000 hours of broadcasting, registered on 2,5
millions documents covering 113-km of shelf space (8-km / year). It would take
133 years for watching or listening all archives.
Audio-visual document have been stored on some physical support, which
may degrade or be destroyed over time. So it is necessary to develop technologies
and numerous effort to maintain and even restore, what belongs to our cultural
patrimony.
Storing is of no use if we cannot access what has been stored. There is a need
for accessing audio-visual documents just to re-view old document, or to build
a new production form previous ones, or to develop specific studies. Accessing
the right audio-visual document has become a real challenge in the morass of
already stored documents. This is the most demanding effort in developing new
and efficient technologies in audio-visual indexing and mining.
retrieval media (the screen, or the printed paper). A device has interfered in
between to rebuild the reading document from the stored document. The core
of the problems linked to the publication of contents lays at the foundation of
the numerical document. For this reason, the outcome of the evolution from
the paper-based document to the digitised document corresponds to the evolu-
tion of the classical indexing for information retrieval to indexing for electronic
publication.
Before the era of digitised document each medium was limited to a dedicated
support, without interaction with other media. The audio-visual information,
recorded on a magnetic tape or a silver film, was not linked to any other in-
formation (texts, graphs, . . . ). Conversely, it was difficult to mix text and good
quality pictures (separate pages in a book) and even more text with audio-visual.
When they are transformed into digitised data, media can more easily be mixed.
Besides technical problems, this new state raises new fundamental difficulties
concerning writing and reading multimedia: how to integrate different medium
in a writing process although it is dedicated for a reading usage? In this context,
indexing becomes a reading instrumentation. Rather than information retrieval
it is aimed at organising information of its reading.
Audio and audio-visual objects impose their rhythmic and reading sequence,
they are temporal objects. They are built in a duration dimension instead of a
spatial dimension as a text would be. They call for revisiting what is indexing.
In the analogue context, audio-visual indexing corresponds to the identification
of a document (cataloguing) and a global description of its content (what it is
about), without a smaller level of description. This level is not need, merely be-
cause there is no means to read a portion of the analogue record, without reading
all the record. The digitised record allows a random access to the content, which
calls for revisiting the indexing issues.
So we come up to three major issues:
• What is a document? We will define the different types of documents, and
we will introduce the notion of meta-data.
• What is indexing? We will define three types of indexing: conceptual index-
ing, document indexing, and content-based indexing.
• What are the numeric multimedia possibilities? What are the difficulties
raised by temporal object, image, or audio-visual flow indexing? What are
the different levels of indexing, according to its nature and the index (can
we index a sound by a word, for example).
2 Document
2.1 What Is a Document ?
Definition: a document is a content instituted by a publication act, written
down a medium, which possesses a spatial and temporal delimitation.
The words used in this definition call for some more precise explanation about
their meaning in the scope of this paper.
Indexing and Mining Audiovisual Data 37
screen), which can be viewed by the viewer. It can be sum up by the following
scheme :
magnetic signal (recording pattern) → video-tape (recording medium) → TV
set (appropriation medium) → visual signal (colour pixels on the screen).
• The semiotic appropriation form: the presentation displayed on the appro-
priation medium follows a structure or a scheme, such that it can be directly
understandable by the user. Consequently, this form is the appropriation
form, which allows the user to appropriate the content. The appropriation
form corresponds to a form, which can be interpreted by the user to the ex-
tent where it belongs to an already known semiotic register. If a document
calls for several semiotic appropriation forms, we will say that the document
is a multimedia document. Audio-visual is multimedia since it calls up im-
ages, music, noise and speech. Images can also be multimedia if it contains
texts and iconic structures.
• The appropriation modality: the appropriation semiotic form concerns one or
several perceptive modalities. If there are several perceptive modalities, the
document is multi-modal. Audio-visual documents are multi-modal, since
they call for the eyesight and the hearing.
1. the recording medium and the appropriation medium are the same: the
medium used for reading is the same as the medium used for storing,
2. the recording form and the appropriation form are the same: what we read
is what has been put on the medium.
The paper-based document became a special case when we had to consider an-
other kind of document, the temporal documents. In this case we must distin-
guish the different semiotic appropriation forms:
• Static and spatial appropriation forms: all the interpretable structures are
presented to the reader at the same time. The way (the order) of reading
are not imposed to the user. Even if the structure of the document suggests
a canonical way of reading, it is not a necessary condition of reading.
• Temporal and dynamic appropriation forms: the interpretable structures are
presented successively to the user, according to an rhythm imposed by the
document. The order and the rhythm of the document constitute the mean-
ing of the document.
This second case corresponds to audio and audio-visual documents. They record
a temporal flow to preserve and to re-build its meaning. However, this kind of
document raise the following problems:
1. the recording form must be hardware, thus it must be spatial. This is the
only way to preserve the temporal information, since it cannot be stored in
the content.
Indexing and Mining Audiovisual Data 39
2. the semiotic form is temporal, which means that the temporality is intrinsic
to the understandability of the document. The document must carry the
temporality by itself.
From this, we can deduce that the recording medium does not equal the appro-
priation medium, and that the recording form does not equal the appropriation
form.
Consequently, it is mandatory to have a process to rebuild the temporal form
of a temporal document from its record. This process must be a mechanical one,
that is to say, a process to re-build a temporal flow from a static and hardware
components ordering. For example, the magnetic signal of a video tape, which
is static, lets drive the video tape-recorder or the TV set to rebuild a physical
appropriation form. The semiotic appropriation form can then merge with the
physical appropriation form.
Thus we must distinguish the recording form from the content, because it
is not self-readable. It is not aimed at to be red, but to be played by some
mechanical means, which will rebuild the temporal form of the document. The
recorded form can be accessed only by the means of a reading device, a player,
which can decode this form to rebuild the document.
Recording Recording
Form Form
(coding) (coding)
Recording
Medium
Restitution
Medium
dimensions linked to the record characterize rather the resources to rebuild the
document than the document itself. So we come to the conclusion that tempo-
ral document exists only during its temporal progress, and that the recording
medium is not a document: it is a resource. A movie is movie only at the time
when it is viewed in the cinema or broadcasted on the TV channel. It is not a
paradox, but just the conclusion of the temporal nature of the document. The
movie, recorded on a silver film, or the audio-visual document registered on the
video tape are not documents. They are tools and resources, that we can use to
access the document.
The viewed document is always the result of a (computed) transformation.
What we consult is not what is recorded, but the output of its transformation
by a computation. Thus a document is linked to the computing process, which
rebuilds it and the information, which parameters and drive the rebuilding. This
information describe what is necessary for matching the rebuilding with the
viewed objective. It is thus meta data, information about the document, which
makes it useful for a given rebuilding. Meta data makes it possible to use and
exploit information. They are a generalisation of index and information retrieval
concepts.
Finally, meta data must be created simultaneously with the document, and
not afterwards like index in general. They are integrated to the content, such
that we can say that meta data are data.
3 Indexing
3.1 What Is Indexing in General ?
According to the AFNOR standard indexing is:
Definition: Describing and characterising a document by the representation of
the concepts contained in the document, and recording the concepts in some
organised file, which can be easily accessed. Files are organised as paper or
digitised sheets, indices, tables, catalogues, etc., queries are handled by these
files.
This definition concerns textual documents, it must be generalised. For this
generalisation, we will introduce later the notion of descriptors. First let’s try to
define indexing in its general accceptation, without the context of any computer
based process. The main objective of indexing is to make information accessible,
by the means of an index. Indexing is the process by which the content of a
document is analysed, in order to be expressed in a form, which allows content
accessing and handling. The word indexing means altogether the process and its
result. Thus indexing is the description of a document for the purpose of a given
use.
Indexing can be typically divided in two steps:
• A conceptual analyse step: the document is analysed for concept extraction.
• A documentary expressing step: concepts are translated into a documentary
language with tools such as thesaurus, classification, etc.
Indexing and Mining Audiovisual Data 41
This last step is important. The type of index is determined by the type of
document handling, which is pursued. Typically the main considered handling
is information retrieval: to find where is the searched information and to ex-
tract the corresponding documents from the documents collection. Indexing has
two purposes: on the one hand, it must be exploitable to determine where is
the searched information, and on the other hand it must permit to fetch this
information. In the library, for example, each book is referred to by a category
determining its content, and a quotation determining its place on a given shelf.
Besides, index cards contain all books descriptions. When a user looks for books
dealing with a given query, the librarian refers to the cards to find out, which
books correspond to the query, and where they are. Indexing was used to find
out the relevant information (the books), and its localisation (the shelf). Index
cards have been built for an easy handling by the librarian.
So, to build an efficient indexing, we must first define the level of accuracy,
which is needed for the application. Accuracy is determined by two characteris-
tics : firstly it corresponds to the richness of the concepts and descriptors used
for indexing, secondly it corresponds to the precision of information localisation
in the described documents. The first one is called acuteness of the description
and the second one is called granularity of the description.
Acuteness depends on the wanted faithfulness of the research results. It is
defined by:
the other hand the modification of one unit may not modify the interpretation
of the document. For example, a given pixel is a digital unit, which has an ar-
bitrary link with the image to which it belongs. Moreover, the modification of
a pixel will not significantly affect the image. Five wrong pixels will not modify
the meaning of an image output on a screen.
Digitising the semiotic support means that a link between each unity and the
content interpretation has been defined. The link may be arbitrary, in the sense
that its linguistic meaning is arbitrary, but it is systematic. The modification
of one unit will modify the meaning. For example the alphabet characters are
discrete units of the written form. Although they have an arbitrary link with the
meaning, the modification of one alphabetic character, may modify the meaning
of a document. What would be the value of a screen, which would modify five
alphabetic characters on each page?
We will discuss later on the multimedia aspects of digitising, however, we can
consider now the example of the compression and digitising MPEG standard for
audiovisual documents. MPEG-1 and MPEG-2 specify compression and digitis-
ing standard based on the pixel unit, and exploiting the redundancy of informa-
tion linked to each pixel. This information concerns luminance and chrominance,
and it has no link with the interpretation of the documents. On the contrary,
the MPEG-4 standard analyses the audiovisual flow as a set of objects linked
by spatio-temporal relations. For example, the television screen background is a
distinct object of the broadcasted flow. From a certain point of view, this object
is a significant sign, and can be used accordingly.
Whatever the aspect of digitising is concerned, be it physical or semiotic, we
can define all arbitrary needed levels of granularity for digitised manipulation
unit. This possibility tremendously increases the complexity of indexing: the ma-
nipulation unit was implicit in the traditional case (the book or the document).
Digitising imposes to explicitly define the part of document, which corresponds
to the searched information.
Indices are not only structured according to the logical and conceptual rela-
tions, which exist between the descriptors that they use. They are also linked by
the relations, which link the described parts of a content. Indices can be based
on markup languages for digitised documents.
Fig. 2. What does show this image: a storm, a criminal attempt, a riot ?
• Queries are built with words (strings built with alphabetic characters).
• The system performs a comparison with patterns (strings, words, . . . ) in the
text, separated by spaces or punctuation characters.
Of course, this process may be improved to lower the noise (documents, which do
not have a relationship with the query, mainly because the context has changed
or reduced the meanings of words of the query), and to improve the silence (by
expanding the query with words of the same meaning as those in the query, in
the concerned context).
This approach can be extended by analogy to audio-visual documents. Thus,
for audiovisual documents content indexing we must make the following assump-
tions :
• Queries built with sounds or images make it possible to find similar doc-
uments based on a distance measurements between images or sounds. The
information retrieval is based on a similarity process.
• We can index a document content with image-indices or sound-indices.
• Iconic descriptors.
• Key-frames.
• Descriptors of the digital signal (texture, histograms, . . . )
• Descriptors computed from descriptors of the digital signal.
Descriptor B
B
C
Finally, indexing can be decomposed into three steps, which represent the
functions that can be assigned to indexing :
4 Multimedia Indexing
We have introduced above the notion of a multimedia document as a document
calling for several semiotic appropriation forms. Audio-visual is multimedia since
46 P. Morizet-Mahoudeaux and B. Bachimont
it calls up images, music, noise and speech. Such a document mobilizes several
semiotic appropriation forms. In addition, if it contains several perceptive modal-
ities, the document is multi-modal. Audio-visual documents are multi-modal,
since they call for the eyesight and the hearing. We have seen also that one
consequence of digitizing is that it allows to consider as much semiotic units
as necessary. With multimedia we must take into account new objects, which
modify the indexing problem: the temporal objects.
• The integration of the documentation and the documents line : the pro-
duction of a temporal object is complex, and thus is realised in distinct
steps, each of which containing its own documentation : production (scripts,
story-board, . . . ), broadcasting (TV magazines, TV conductors, . . . ), archiv-
ing (notes, annexes, dots). The computerisation allows the integration of all
documents on the same support. It allows also to exchange the correspond-
ing information, along the audio-visual object life cycle. The content is built
up, broadcasted, and archived carrying along its documentation, which is
enriched and adapted along its life cycle.
• Document and documentation link : segments of the document and its doc-
umentation can be linked, thanks to the digital support.
Indexing and Mining Audiovisual Data 47
Fig. 4. The document information is not in the document, and its organisation is not
that of the described document
• Integration of the document and the documentation : the document and the
documentation are on the same support and the documentation is mandatory
for exploiting the document.
access to multimedia data. The standard defines three kinds of information that
comprise the description :
• Features: they concern all what we need to express for describing the docu-
ment (authors, scenes, segments, . . . )
• Descriptors: they are the formal representation of a feature,
• Schemata description: they correspond to structures linking two different
descriptors or other schemata descriptors. To a certain extent, they are an
adaptation to MPEG-7 of the notion of DTD of XML, or of Schema of
XML-Schema
• The Description Definition Language (DLL), to create descriptors and de-
scription schemata. It is an extension of XML-Schema
• sounds effects.
• the quality in tone of music instruments. The scheme describes the percep-
tive properties with a reduced number of descriptors, such as “richness”,
“attack”, . . .
• the speech : it is described by a combination of sounds and words, which
allows to find unknown words of a given vocabulary with the associated
sounds.
• the melodies : the scheme is designed to allow query by analogy, specially
with an aria whistled or hummed by the user.
• low-level descriptions (temporal envelope, spectral representation, harmon-
ics, . . . ).
A “silence” descriptor permits to describe a silent content.
<!-- ################################################-->
<!-- Definition of Transition DS -->
<!-- ################################################-->
<complexType name="TransitionType">
<complexType>
<extension base="mpeg7:VideoSegmentType">
<sequence>
<element type="GradualEvolution"
type="mpeg7:GradualEvolutionType"
minOccurs="0"/>
<element name="SpatioTemporalLocator"
type="mpeg7:SpatioTemporalLocatorType"
minOccurs="0"/>
Indexing and Mining Audiovisual Data 51
</sequence>
<attribute name="editingLevel" use="optional">
<simpleType base="string">
<enumeration value="global"/> <!--Or InterShot-->
<enumeration value="composition"/>
6 Current Approaches
We will present three current approaches to audio-visual document indexing. The
first refers to the classical text mining approach. Although it does not bring new
concepts in indexing, it remains the most developed approach for the present
time. It is mainly due to its efficiency, and its facility of use. We will shortly
present some recent approaches, and will not go further in detailed description.
Other approaches use spatio-temporal description and object oriented models of
multimedia data bases. We invite the reader to use the bibliographical references
given at the end of the paper [16, 17, 18, 19, 27, 28, 29, 30, 31, 32]. The second
approach is mainly based on the XML Schema tools, which is a very promising
approach. The last one, is based on an prototype, which has been developed at
INA.
Let N be the total number of documents and idfi the number of documents
where the string i appears.
N − idfi
tri = log
idfi
N
P2i,j = (1 + log tfi,j ) log
dfi
where, dfi is the number of documents, where the word wi appears. P1 represents
the relevance of representing one document Dj by the word wi , and P2 represents
the selectivity of the word wi in the collection of documents. We usually compute
a combination of P1 , and P2 to take advantage of both characteristics .
This basic notion is completed by the statistical distribution of strings in the
documents or linguistic approaches to compute the weight of strings.
ontology
Transcription
wordi
Indexation
Index
Doc D1 ……….… Dj ………….. Dn
Notices, annexe, …
…….
Index + context: wordi ……………….. fi,j
- arounding words
- doc
- temp. pos. in doc.
- ontology
Fig. 7. Words and documents are organised on the maps according to their respective
distance
• xsd:element: They define the name and type of each XML element. For
example the statement
<xsd:element name = "AudioVisual" type = " AudioVisualDS " />
defines that XML files may contain an element named AudioVisual with
complex content and as described elsewhere in the XML-Schema.
• xsd:attribute: They describe the name and type of attributes of an XML el-
ement. They may contain an attribute use with the value required, which
states that these attributes are mandatory for this XML element. They can
be of simple (integer, float, string, etc.) or complex type (i.e. enumerations,
numerical range etc.). For example, in the following statements, the first
xsd:attribute definition states that the Trackleft attribute may take an inte-
ger value, whereas the second xsd:attribute, noted AVType may take a value
the format of which is described elsewhere in the XML-Schema.
<xsd:attribute name = "TrackLeft" type =" xsd:integer" use="required"/>
<xsd:attribute name = "AVType" type = "AVTypeD" use="required"/>
• xsd:simpleType: They define new datatypes that can be used for XML at-
tributes. For example :
<xsd:simpleType name = "AVTypeD" >
<xsd:restriction base = "string">
<xsd:enumeration value = "Movie" />
<xsd:enumeration value = "Picture" />
</xsd:restriction>
</xsd:simpleType>
• xsd:attributeGroup: They group xsd:attribute definitions that are used by
many XML elements.
• xsd:complexType: They represent the various entities of the metadata model.
They contain:
• one ore more <xsd:attribute> tags
• one ore more<xsd:element> tags
Here follows the complete definition of the internal structure of the Audio-
visual XML element :
<xsd:element name="AudioVisual type=" AudioVisualDS " />
<xsd:complexType name="AudioVisualDS">
<xsd:attribute name="id" type=" ID use="required />
<xsd:attribute name="AVType" type="AVTypeD" use="required"/>
<xsd:sequence>
<xsd:element maxOccurs="1" minOccurs="0"name="Syntactic" type="SyntacticDS" />
<xsd:element maxOccurs="1" minOccurs="0"name="Semantic" type="SemanticDS" />
<xsd:element maxOccurs="1" minOccurs="0" ref="MetaInfoDS" />
</xsd:sequence> </xsd:complexType>
<xsd:simpleType name="AVTypeD">
<xsd:restriction base="string">
Indexing and Mining Audiovisual Data 55
• has two attributes (”id”, which is a number and ”AVType”, which can take
one of the values Movie, Picture, or Document)
• may contain two sub-elements namely Syntactic and Semantic (with their
sub-elements)
• may contain e reference to a MetaInfoDS element
The structure of the XML Schema file contains the definition of the various
entities and some supportive structure, and the description of the structure of
the database commands.
The extension elements are:
• DBCommand
• DBInsert, DBUpdate
• DBDelete
• DBReply
• DBSelect
Finally, for mapping XML-SCHEMA into a Relational schema, the XML-
Schema file that contains information about structure of the exchanged XML
documents is used to generate the relational database structures. Then, XML
documents are parsed to contruct the appropriate SQL commands.
7 Conclusion
We have seen that the main outcome of audio-visual document digitizing is
the integration of all documents linked to a video production line on the same
support.
The main consequence is a revisiting of the definition of indexing, which
generalizes the text indexing definition. By breaking the document integrity and
putting all documents on the same support it becomes possible to define indices
linked to the semantical objects, which compose an audio-visual document. The
most important property of these objects is that they can be temporal objects.
The development of marked-up languages has open the possibility of devel-
oping indexing based on a semantical document description. It also made it pos-
sible to propose effective computer-based systems, which can implement actual
application.
References
[1] Gwendal Auffret, Jean Carrive, Olivier Chevet, Thomas Dechilly, Remi Ronfard,
and Bruno Bachimont, Audiovisual-based Hypermedia Authoring - using struc-
tured representations for efficient access to AV documents, in ACM Hypertext,
1999
[2] G. Akrivas, S. Ioannou, E. Karakoulakis, K. Karpouzis,Y. Avrithis, A.Delopoulos,
S. Kollias, M. Vazirgiannis, I.Varlamis. An Intelligent System for Retrieval and
Mining of Audiovisual Material Based on the MPEG-7 Description Schemes. Eu-
ropean Symposium on Intelligent Technologies, Hybrid Systems and their imple-
mentation on Smart Adaptive Systems (EUNITE), Spain, 2001.
[3] Bachimont B. et Dechilly T., Ontologies pour l’indexation conceptuelle et struc-
turelle de l’audiovisuel, Eyrolles, dans Ingénierie des connaissances 1999-2001.
[4] Bachimont B., Isaac A., Troncy R., Semantic Commitment for Designing On-
tologies : a Proposal. 13th International Conference EKAW’02, 1-4 octobre 2002,
Siguenza, Espagne
Indexing and Mining Audiovisual Data 57
[21] J. Ontrup and H. Ritter, Hyperbolic Self-Organizing Maps for Semantic Naviga-
tion, Advances in Neural Information Processing Systems 14, 2001
[22] E. Pampalk, A. Rauber, and D. Merkl, Content-based Organization and Visual-
ization of Music Archives, Proceedings of the ACM Multimedia 2002, pp 570-579,
Juan les Pins, France, December 1-6, 2002.
[23] E. Pampalk, A. Rauber, D. Merkl, Using Smoothed Data Histograms for Cluster
Visualization in Self-Organizing Maps, Proceedings of the Intl Conf on Artificial
Neural Networks (ICANN 2002), pp. 871-876, August 27.-30. 2002, Madrid, Spain
[24] I. Varlamis, M. Vazirgiannis, P. Poulos. Using XML as a medium for describing,
modifying and querying audiovisual content stored in relational database systems,
International Workshop on Very Low Bitrate Video Coding (VLBV). Athens 2001.
[25] I. Varlamis, M. Vazirgiannis. Bridging XML-Schema and relational databases. A
system for generating and manipulating relational databases using valid XML
documents, in the proceedings of ACM Symposium on Document Engineering,
Nov. 2001, Atlanta, USA
[26] I. Varlamis, M. Vazirgiannis. Web document searching using enhanced hyperlink
semantics based on XML, proceedings of IDEAS ’01, Grenoble, France.
[27] M. Vazirgiannis D. Tsirikos, Th.Markousis, M. Trafalis, Y. Stamati, M. Hat-
zopoulos, T. Sellis. Interactive Multimedia Documents: a Modeling, Authoring
and Rendering approach, in Multimedia Tools and Applications Journal (Kluwer
Academic Publishers), 2000.
[28] M. Vazirgiannis, Y. Theodoridis, T. Sellis. Spatio-Temporal Composition and In-
dexing for Large Multimedia Applications, in ACM/Springer-Verlag Multimedia
Systems Journal, vol. 6(4), 1998.
[29] M. Vazirgiannis, C. Mourlas. An object Oriented Model for Interactive Multime-
dia Applications, The Computer Journal, British Computer Society, vol. 36(1),
1/1993.
[30] M. Vazirgiannis. Multimedia Data Base Object and Application Modeling Issues
and an Object Oriented Model, in Multimedia Database Systems: Design and
Implementation Strategies, Kluwer Academic Publishers, 1996, Pages: 208-250.
[31] E.Veneau, R.Ronfard, P.Bouthemy, From Video Shot Clustering to Sequence
Segmentation, Fifteenth International Conference on Pattern Recognition
(ICPR’2000), Barcelona, september 2000.
[32] D. Vodislav, M. Vazirgiannis. Structured Interactive Animation for Multimedia
Documents, Proceedings of IEEE Visual Languages Symposium, Seattle, USA,
September 2000.
Relevance Feedback Document Retrieval Using
Support Vector Machines
1 Introduction
As the Internet technology progresses, accessible information by end users is ex-
plosively increasing. In this situation, we can now easily access a huge document
database through the WWW. However it is hard for a user to retrieve rele-
vant documents from which he/she can obtain useful information, and a lot of
studies have been done in information retrieval, especially document retrieval[1].
Active works for such document retrieval have been reported in TREC(Text Re-
trieval Conference)[2] for English documents, IREX(Information Retrieval and
Extraction Exercise)[3] and NTCIR(NII-NACSIS Test Collection for Informa-
tion Retrieval System)[4] for Japanese documents.
In most frameworks for information retrieval, a Vector Space Model (which
is called VSM) in which a document is described with a high-dimensional vector
is used[5]. An information retrieval system using a vector space model computes
the similarity between a query vector and document vectors by cosine of the two
vectors and indicates a user a list of retrieved documents.
In general, since a user hardly describes a precise query in the first trial,
interactive approach to modify the query vector by evaluation of the user on
documents in a list of retrieved documents. This method is called relevance
feedback[6] and used widely in information retrieval systems. In this method,
a user directly evaluates whether a document is relevant or irrelevant in a list
of retrieved documents, and a system modifies the query vector using the user
evaluation. A traditional way to modify a query vector is a simple learning rule
to reduce the difference between the query vector and documents evaluated as
relevant by a user.
In another approach, relevant and irrelevant document vectors are consid-
ered as positive and negative examples, and relevance feedback is transposed to
a binary classification problem[7]. Okabe and Yamada[7] proposed a frame work
in which relational learning to classification rules was applied to interactive doc-
ument retrieval. Since the learned classification rules is described with symbolic
representation, they are readable to our human and we can easily modify the
rules directly using a sort of editor. For the binary classification problem, Sup-
port Vector Machines(which are called SVMs) have shown the excellent ability.
And some studies applied SVM to the text classification problems[8] and the
information retrieval problems[9].
Recently, we have proposed a relevance feedback framework with SVM as ac-
tive learning and shown the usefulness of our proposed method experimentally[10].
Now, we are interested in which is the most efficient representation for the doc-
ument retrieval performance and the learning performance, boolean representa-
tion, TF representation or TFIDF representation, and what is the most useful
selecting rule for displayed documents at each iteration. In this paper, we adopt
several representations of the Vector Space Model(which is called VSM) and
several selecting rules of displayed documents at each iteration, and then show
the comparison results of the effectiveness for the document retrieval in these
several situations.
In the remaining parts of this paper, we explain a SVM algorithm in the
second section briefly. An active learning with SVM for the relevance feedback,
and our adopted VSM representations and selecting displayed documents rules
are described in the third section. In the fourth section, in order to compare
the effectiveness of our adopted representations and selecting rules, we show
our experiments using a TREC data set of Los Angeles Times and discuss the
experimental results. Eventually we conclude our work in the fifth section.
same source(pdf) as the unseen test data. This concerns the experimental setup.
Second, the size of the class of functions from which we choose our estimate
f , the so-called capacity of the learning machine, has to be properly restricted
according to statistical learning theory[11]. If the capacity is too small, complex
discriminant functions cannot be approximated sufficiently well by any selectable
function f in the chosen class of functions the learning machine is too simple to
learn well. On the other hand, if the capacity is too large, the learning machine
bears the risk of overfitting. In neural network training, overfitting is avoided by
early stopping, regularization or asymptotic model selection[12, 13, 14, 15].
For SV learning machines that implement linear discriminant functions in
feature spaces, the capacity limitation corresponds to finding a large margin
separation between the two classes. The margin is the minimal distance of
training points (x1 , y1 ), . . . , (xi , yi ), xi ∈ R, yi ∈ {±1} to the separation surface,
i.e. = mini=1,..., ρ(zi , f ), where zi = (xi , yi ) and ρ(zi , f ) = yi f (xi ), and f is
the linear discriminant function in some feature space
f (x) = (w · Φ(x)) + b = αi yi (Φ(xi ) · Φ(x)) + b, (1)
i=1
with w expressed as w = i=1 αi yi Φ(xi ). The quantity Φ denotes the mapping
from input space X by explicitly. transforming the data into a feature space
F using Φ : X → F. (see Figure 1). SVM can do so implicitly. In order to
train and classify, all that SVMs use are dot products of pairs of data points
Φ(x), Φ(xi ) ∈ F in feature space (cf. Eq. (1)). Thus, we need only to supply a
so-called kernel function that can compute these dot products. A kernel function
k allows to implicitly define the feature space (Mercer’s Theorem, e.g. [16]) via
k(x, xi ) = (κ · (x · xi ) + Θ)d ,
Fig. 1. A binary classification toy problem: This problem is to separate black circles
from crosses. The shaded region consists of training examples, the other regions of test
data. The training data can be separated with a margin indicated by the slim dashed
line and the upper fat dashed line, implicating the slim solid line as discriminate func-
tion. Misclassifying one training example(a circled white circle) leads to a considerable
extension(arrows) of the margin(fat dashed and solid lines) and this fat solid line can
classify two test examples(circled black circles) correctly
Note that there is no need to use or know the form of Φ, because the mapping
is never performed explicitly The introduction of Φ in the explanation above was
for purely didactical and not algorithmical purposes. Therefore, we can compu-
tationally afford to work in implicitly very large (e.g. 1010 - dimensional) feature
spaces. SVM can avoid overfitting by controlling the capacity and maximizing
the margin. Simultaneously, SVMs learn which of the features implied by the
kernel k are distinctive for the two classes, i.e. instead of finding well-suited fea-
tures by ourselves (which can often be difficult), we can use the SVM to select
them from an extremely rich feature space.
With respect to good generalization, it is often profitable to misclassify some
outlying training data points in order to achieve a larger margin between the
other training points (see Figure 1 for an example). This soft-margin strategy can
also learn non-separable data. The trade-off between margin size and number of
misclassified training points is then controlled by the regularization parameter C
(softness of the margin). The following quadratic program (QP) (see e.g. [11, 17]):
min w2 + C i=1 ξi
ξi ≥ 0 for all 1 ≤ i ≤
Iterative Part
Relevance Feedback
Fig. 2. Image of the relevance feedback documents retrieval: The gray arrow parts
are made iteratively to retrieve useful documents for the user. This iteration is called
feedback iteration in the information retrieval research area
min w2
(4)
s.t. ρ(zi , f ) ≥ 1 for all 1 ≤ i ≤
Margin area
w
Relevant
documents area
Discriminant function
Fig. 3. The figure shows a discriminant function for classifying relevant or irrelevant
documents: Circles denote documents which are checked relevant or irrelevant by a user.
The solid line denotes a discriminant function. The margin area is between dotted lines
using cosine distance between the request query vector and each docu-
ment vector for the first feedback iteration.
Step 2: Judgment of documents:
The user then classifiers these N documents into relevant or irrelevant.
The relevant documents and the irrelevant documents are labeled. For
instance, the relevant documents have ”+1” label and the irrelevant doc-
uments have ”-1” label after the user’s judgment.
Step 3: Determination of the optimal hyperplane:
The optimal hyperplane for classifying relevant and irrelevant documents
is determined by using a SVM which is learned by labeled documents(see
Figure 3).
Step 4: Discrimination documents and information retrieval:
The documents, which are retrieved in the Step 1, are mapped into the
feature space. The SVM learned by the previous step classifies the doc-
uments as relevant or irrelevant. Then the system selects the documents
based on the distance from the optimal hyper plane and the feature of
the margin area. The detail of the selection rules are described in the
next section. From the selected documents, the top N ranked documents,
which are ranked using the distance from the optimal hyperplane, are
shown to user as the information retrieval results of the system. If the
number of feedback iterations is more than m, then go to next step.
Otherwise, return to Step 2. The m is a maximal number of feedback
iterations and is given by the user or the system.
Relevance Feedback Document Retrieval Using Support Vector Machines 65
w
Relevant
documents area
Discriminant function
Fig. 4. The figure shows displayed documents as the result of document retrieval:
Boxes denote non-checked documents which are mapped into the feature space. Circles
denotes checked documents which are mapped into the feature space. The system
displays the documents which are represented by black circles and boxes as the result
of document retrieval to a user
w
Relevant
documents area
Margin area
Fig. 5. Mapped non-checked documents into the feature space: Boxes denote non-
checked documents which are mapped into the feature space. Circles denotes checked
documents which are mapped into the feature space. Black and gray boxes are doc-
uments in the margin area. We show the documents which are represented by black
boxes to a user for next iteration. These documents are in the margin area and near
the relevant documents area
Selection Rule 1:
The retrieved documents are mapped into the feature space. The learned
SVM classifies the documents as relevant or irrelevant. The documents,
which are discriminated relevant and in the margin area of SVM are
selected. From the selected documents, the top N ranked documents,
which are ranked using the distance from the optimal hyperplane, are
displayed to the user as the information retrieval results of the system(see
Figure 5). This rule is expected to achieve the most effective retrieval
and keep the learning performance. This rule is our proposed one for the
relevance feedback document retrieval.
Selection Rule 2:
The retrieved documents are mapped into the feature space. The learned
SVM classifies the documents as relevant or irrelevant. The documents,
Relevance Feedback Document Retrieval Using Support Vector Machines 67
w
Relevant
documents area
Margin area
Fig. 6. Mapped non-checked documents into the feature space: Boxes denote non-
checked documents which are mapped into the feature space. Circles denotes checked
documents which are mapped into the feature space. Black and gray boxes are doc-
uments in the margin area. We show the documents which are represented by black
boxes to a user for next iteration. These documents are near the optimal hyperplane
4 Experiments
4.1 Experimental Setting
In the reference [10], we already have shown that the utility of our interactive
document retrieval with active learning of SVM is better than the Rocchio-based
interactive document retrieval[6], which is conventional one. This paper presents
the experiments for comparing the utility for the document retrieval among
several VSM representations, and the effectiveness for the learning performance
among the several selection rules, which choose the displayed documents to judge
whether a document is relevant or irrelevant by the user. The document data set
we used is a set of articles in the Los Angeles Times which is widely used in the
document retrieval conference TREC [2]. The data set has about 130 thousands
articles. The average number of words in an article is 526. This data set includes
not only queries but also the relevant documents to each query. Thus we used
the queries for experiments.
68 T. Onoda, H. Murata, and S. Yamada
1.0
0.8
0.6
Precision
0.4 tf
boolean
0.2
tf−idf
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Fig. 7. The figure shows the retrieval effectiveness of SVM based feedback(using the
selection rule 1) for the boolean, TF, and TFIDF representations: The lines show
recall-precision performance curve by using twenty feedback documents on the set of
articles in the Los Angeles Times after 3 feedback iterations. The wide solid line is the
boolean representation, the broken line is TFIDF representation, and the solid line is
TF representation
1.0
0.8
0.6 proposed
Precision
0.4
0.2
hyperplane
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Recall
Fig. 8. The figure shows the retrieval effectiveness of SVM based feedback for the
selection rule 1 and 2: The lines show recall-precision performance curves by using
twenty feedback documents on the set of articles in the Los Angeles Times after 3
feedback iterations. The thin solid line is the selection rule 1, and the thick solid line
is the selection rule 2
Table 2. Average total number of relevant documents for the selection rule 1 and 2
using the boolean representation
iterations. In this table, iteration number 0 denotes the retrieval result based
on cosine distance between the query vector and document vectors, which are
represent by the TFIDF. We can see from the table 1 that the average number
of relevant documents in the twenty displayed documents for the selection rule
1 is higher than that of the selection rule 2 at each iteration. After all, when the
selection rule 2 is adopted, the user have to see a lot of irrelevant documents
at each iteration. The selection rule 1 is effective to immediately put on the
upper rank the special documents,which relate to the user’s interesting. When
the selection rule 1 is adopted, the user do not need to see a lot of irrelevant
documents at each iteration. However, it is hard for the rule 1 to immediately
put on the upper rank all documents, which relate to the user’s interesting. In
the document retrieval, a user do not want to get all documents, which relate
to the user’s interest. The user wants to get some documents, which relate to
the user’s interest as soon as possible. Therefore, we conclude that the feature
of the selection rule 1 is better than that of the selection rule 2 for the relevance
feedback document retrieval.
Table 2 gives the average total number of relevant documents for the selec-
tion rule 1 and 2 at each iteration. In this table, iteration number 0 denotes the
retrieval result based on cosine distance between the query vector and document
vectors, which are represent by the TFIDF. In table 2, the average number of
relevant documents in the twenty displayed documents decreases at iteration
number 3. We can see from table 2 that almost relevant documents can be found
at the iteration number 3. This situation means that there are few relevant doc-
uments in the rest documents and it is difficult to find new relevant documents
in the rest documents. Therefore, the average number of relevant documents in
72 T. Onoda, H. Murata, and S. Yamada
the twenty displayed documents decreases with the increase of the number of
iterations.
5 Conclusion
In this paper, we adopt several representations of the Vector Space Model and
several selecting rules of displayed documents at each iteration, and then show
the comparison results of the effectiveness for the document retrieval in these
several situations.
In our experiments, when we adopt our proposed SVM based relevance feed-
back document retrieval, the binary representation and the selection rule 1,
where the documents that are discriminated relevant and in the margin area of
SVM, are displayed to a user, show better performance of document retrieval.
In future work, we will plan to analyze our experimental results theoretically.
References
1. Yates, R.B., Neto, B.R.: Modern Information Retrieval. Addison Wesley (1999)
2. TREC: (http://trec.nist.gov/)
3. IREX: (http://cs.nyu.edu/cs/projects/proteus/irex/)
4. NTCIR: (http://www.rd.nacsis.ac.jp/˜ntcadm/)
5. Salton, G., McGill, J.: Introduction to modern information retrieval. McGraw-Hill
(1983)
6. Salton, G., ed. In: Relevance feedback in information retrieval. Englewood Cliffs,
N.J.: Prentice Hall (1971) 313–323
7. Okabe, M., Yamada, S.: Interactive document retrieval with relational learning. In:
Proceedings of the 16th ACM Symposium on Applied Computing. (2001) 27–31
8. Tong, S., Koller, D.: Support vector machine active learning with applications to
text classification. In: Journal of Machine Learning Research. Volume 2. (2001)
45–66
9. Drucker, H., Shahrary, B., Gibbon, D.C.: Relevance feedback using support vector
machines. In: Proceedings of the Eighteenth International Conference on Machine
Learning. (2001) 122–129
10. Onoda, T., Murata, H., Yamada, S.: Interactive document retrieval with active
learning. In: International Workshop on Active Mining (AM-2002), Maebashi,
Japan (2002) 126–131
11. Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)
12. Bishop, C.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford
(1995)
13. Murata, N., Yoshizawa, S., Amari, S.: Network information criterion - determining
the number of hidden units for an artificial neural network model. IEEE Transac-
tions on Neural Networks 5 (1994) 865–872
14. Onoda, T.: Neural network information criterion for the optimal number of hidden
units. In: Proc. ICNN’95. (1995) 275–280
15. Orr, J., Müller, K.R., eds.: Neural Networks: Tricks of the Trade. LNCS 1524,
Springer Verlag (1998)
Relevance Feedback Document Retrieval Using Support Vector Machines 73
16. Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classi-
fiers. In Haussler, D., ed.: 5th Annual ACM Workshop on COLT, Pittsburgh, PA,
ACM Press (1992) 144–152
17. Schölkopf, B., Smola, A., Williamson, R., Bartlett, P.: New support vector algo-
rithms. Neural Computaion 12 (2000) 1083 – 1121
18. Schapire, R., Singer, Y., Singhal, A.: Boosting and rocchio applied to text filtering.
In: Proceedings of the Twenty-First Annual International ACM SIGIR. (1998) 215–
223
19. Kernel-Machines: (http://www.kernel-machines.org/)
Micro View and Macro View Approaches
to Discovered Rule Filtering
1 Introduction
The active mining [1] is a new approach to data mining, which tries to discover "high
quality" knowledge that meets users' demand in an efficient manner by integrating
information gathering, data mining, and user reaction technologies. This paper argues
the discovered rule filtering method [3,4] that filters rules obtained by a data mining
system based on documents retrieved from an information source on the Internet.
Data mining is an automated method to discover useful knowledge for users by
analyzing a large volume of data mechanically. Generally speaking, conventional
methods try to discover significant relations among attributes in the statistic sense
from a large number of attributes contained in a given database, but if we pay
attention to only statistically significant features, we often discover rules that have
been known by the user. To cope with this problem, we are developing a discovered
rule filtering method that filters a large number of rules discovered by a data mining
system to be novel ones to the user. To judge whether a rule is novel or not, we utilize
information sources on the Internet and try to judge the novelty of rule according to
the search result of document retrieval that relates to the discovered rule.
In this paper, we first discuss the principle of integrating data mining and informa-
tion in Section 2, and we show the concept and the process of discovered rule filtering
using an example of clinical data mining in Section 3. We then show two approaches
toward discovered rule filtering; the micro view approach and the macro view ap-
proaches in Section 4. In Section 5, we show an evaluation of the macro view ap-
proach. Finally we conclude this paper with our future work in Section 6.
hepatitis+gpt hepatitis
(4745) +total cholesterol
(48)
hepatitis+gpt hepatitis+gpt
+total cholesterol +total cholesterol
+alb (0) +albumin (6)
redundant keyword submissions. The graph in Fig. 2 shows pairs of submitted key-
words and the number of hits. For example, this graph shows that a submission includ-
ing keywords “hepatitis,” “gpt,” and “t-cho” returns nothing. It also shows that the
combination of “hepatitis” and “gpt” is better than the combination of “hepatitis” and
“total cholesterol” because the former is expected to have more returns than the latter.
to analyze the abstract of the document and to automatically find keywords that show
the user’s interest, and uses them for further document retrievals.
In the micro view approach, we retrieve and show documents related to a discovered
rule directly to the user.
By using the micro view approach, the user can obtain not only novel rules discov-
ered by a data mining system, but also documents related to the rules. By showing a
rule and documents related to the rule at once, the user can get more insights on the
rule and may have a chance to start a new data mining task. In our preliminary ex-
periment, at first we showed a discovered rule alone, shown in Figure 1, to a medical
doctor and received the following comment (Comment 1). The doctor seems to take
the discovered rule as a commonly known fact.
Comment 1: “TTT shows an indicator of the activity of antic body. The more ac-
tive the antic bodies are, the less active the hepatitis is and therefore the amount of
GPT decreases. This rule can be interpreted by using well known facts.”
We then retrieved related documents by using the rule filtering technique. The
search result with keywords “hepatitis” and “TTT” was 11 documents. Among them,
there was a document, shown in Fig. 3, in which the doctor shows his interest as men-
tioned in a comment (Comment 2).
Comment 2: “This document discusses that we can compare type B virus with
type C virus by measuring the TTT value of hepatitis virus carriers (who have not
contracted hepatitis). It is a new paper published in 2001 that discusses a relation
between TTT and hepatitis, but it reports only a small number of cases. The dis-
covered rule suggests the same symptom appears not only in carriers but also in
patients. This rule is important to support this paper from a standpoint of clinical
data.”
The effect shown in this preliminary examination is that the system can retrieve not
only a new document related to a discovered rule but also a new viewpoint to the rule,
and gives a chance to invoke a new mining process. In other words, if the rule alone is
shown to the user, it is recognized just as a common fact, but if it is shown with a
related document, it can motivate the user to analyze the amount of TTT depending
on the type of hepatitis by using a large volume of hepatitis data. We hope this kind of
effect can be found in many other cases.
80 Y. Kitamura, A. Iida, and K. Park
To see how the micro view approach works, we performed a preliminary experi-
ment of discovered rule filtering. We used 20 rules obtained from the team in Shizu-
oka University and gathered documents related to the rules from the MEDLINE data-
base. The result is shown in Table 1.
In this table, "ID" is the ID number of rule and "Keywords" are extracted from the
rule and are submitted to the Pubmed. "No" shows the number of submitted key-
words. "Hits" is the number of documents returned. "Ev" is the evaluation of rule by a
medical doctor. He evaluated each rule, which was given in a form depicted in Fig. 1,
and categorized into 2 classes; R (reasonable rules) and U (unreasonable rules).
This result tells us that it is not easy to distinguish reasonable or known rules from
unreasonable or garbage ones by using only the number of hits. It shows a limitation
of micro view approach.
Micro View and Macro View Approaches to Discovered Rule Filtering 81
To cope with the problem, we need to improve the performance of micro view ap-
proach as follows.
As we can see, except Rule 13, rules with hits more than 0 are categorized in rea-
sonable rules, but a number of reasonable rules hit no document. It seems that the
number of submitted keywords affects the number of hits. In other words, if a rule is
complex with many keywords, the number of hits tends to be few.
Rule with only one cluster are regarded as known rules because a large number of
papers concerning every pair of keywords in the rule have been published. Rules with
two clusters are regarded as unknown rules. This is because research activities con-
cerning keywords in each cluster have been done much, but those crossing the clus-
ters have not been done. Rule with more than two clusters are regarded as garbage
rules. Such a rule is too complex to understand because keywords are partitioned into
many clusters and the rule consists of many unknown factors.
For example, if we set the threshold of CLINK to be 1 (the frequency of
co-occurrences is 1), the rule in Fig. 4 is regarded as a known rule because all the
keywords are merged into a single cluster. Keywords in Fig. 5 are merged into two
clusters; one cluster consists of GPT and T-CHO and another consists of chyle only.
Hence, the rule is judged to be unknown. Keywords in Fig. 6 are merged into 3
Fig. 4. The keyword co-occurrence graph of rule including GPT, ABL, and T-CHO
Fig. 5. The keyword co-occurrence graph of rule including GPT, T-CHO, and chyle
84 Y. Kitamura, A. Iida, and K. Park
Fig. 6. The keyword co-occurrence graph of rule including GPT, G-GTP, hemolysis, female
and “blood group a”
Jaccard Coefficient
Jaccard Coefficient
Year Year
Jaccard Coefficient
Jaccard Coefficient
Year Year
clusters as GPT, G-GTP, and female form a cluster and each of hemolysis and “blood
group a” forms an individual cluster.
As a conclusion, the graph shape of reasonable rules looks different from that of
unreasonable rules. But, when given a graph, how to judge whether the rule is reason-
able or not is our future work.
Micro View and Macro View Approaches to Discovered Rule Filtering 85
hcv hepatitis
0.3
0.25
tn
ie 0.2
cfi
fe
oc 0.15
rda
cc 0.1
aJ
0.05
0
63 966 969 972 975 978 981 984 987 990 993 996 999 002
19 1 1 1 1 1 1 1 1 1 1 1 1 2
year
Fig. 8. The yearly trend of the Jaccard co-efficient concerning “hcv” and “hepatitis”
smallpox vaccine
0.09
0.08
t 0.07
ne
ic 0.06
fif
e 0.05
oc
dr 0.04
ac
ca 0.03
J
0.02
0.01
0
63 66 69 72 75 78 81 84 87 90 93 96 99 02
19 19 19 19 19 19 19 19 19 19 19 19 19 20
year
Fig. 9. The yearly trend of the Jaccard co-efficient concerning “snallpox” and “vaccine”
To show feasibility in the hepatitis data mining domain, we measured the yearly
trend of Jaccard co-efficient of “hepatitis” and each of five representative hepatitis
Micro View and Macro View Approaches to Discovered Rule Filtering 87
viruses (hav, hbv, hcv, hdv, and hev) and show the results in Fig. 12. We also show
the history of hepatitis virus discovery in Table 2. There is apparently a co-relation
between the Jaccard co-efficient and the discovery time of hepatitis viruses. In
hepatitis research activities, works on hbv and hcv are major and especially those on
hcv rapidly increase after its discovery.
gpt got
0.8
tn 0.7
ei 0.6
cfi
fe 0.5
oc
0.4
dr
ac 0.3
ca 0.2
J
0.1
0
63 66 69 72 75 78 81 84 87 90 93 96 99 02
19 19 19 19 19 19 19 19 19 19 19 19 19 20
year
Fig. 10. The yearly trend of the Jaccard co-efficient concerning “gpt” and “got”
0.3
0.25
tn
iec 0.2
fif
eo
c0.15
rda
cc 0.1
aJ
0.05
0
63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 01
19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20
year
Fig. 11. The yearly trend of the Jaccard co-efficient concerning “albumin” and “urea nitrogen”
88 Y. Kitamura, A. Iida, and K. Park
0.2
0.15
0.1
0.05
0
1963
1966
1969
1972
1975
1978
1981
1984
1987
1990
1993
1996
1999
2002
year
Fig. 12. The yearly trend of the Jaccard co-efficient concerning hepatitis viruses
From above results, the yearly trends well correspond with historical events in the
medical domain, and can be a measure to know the research activities.
medical doctor, so we suppose they are knowledgeable about the medical knowledge
in text books.
Q: How do you guess the result when you submit the following keywords to the
Pubmed system? Choose one among A, B, and C.
0.7
0.6
ec
io 0.5 known
hc unknown
fo 0.4 garbage
iot
ar
eg 0.3
ar
ev
A 0.2
0.1
0
1 2 3
Number of clusters
Fig. 14. The relation between the number of clusters and the evaluation of medical experts
We verify the hypothesis of the macro view method by using the result of the ques-
tionnaire. We show the relation between the number of clusters and the average ratio
of choice in Fig. 14. The threshold of CLINK is 1. At the risk level of 5%, the graph
shows two significant relations.
z As the number of clusters increases, the average ratio of “unknown” in-
creases.
z As the number of clusters increases, the average ratio of “known” decreases.
90 Y. Kitamura, A. Iida, and K. Park
The result does not show any significant relation about “garbage” choice because
the number of students who chose “garbage” is relatively small to the other choices
and does not depend on the number of clusters. We suppose the medical students
hesitate to judge that a rule is just garbage.
The hypotheses of the macro view approach are partly supported by this evalua-
tion. The maximum number of clusters in this examination is 3. We still need to ex-
amine how medical experts judge rules with more than 4 clusters.
6 Summary
We discussed a discovered rule filtering method which filters rules discovered by a
data mining system into novel ones by using the IR technique. We proposed two ap-
proaches toward discovered rule filtering; the micro view approach and the macro
view approach and showed merits and demerits of micro view approach and feasibil-
ity of macro view approach.
Our future work is summarized as follows.
z We need to find a clear measure to distinguish reasonable rules from unrea-
sonable one, which can be used in the macro view method. We also need to
find a measure to know the novelty of rule.
z We need to improve the performance of micro view approach by adding key-
words that represent relations among attributes and by using natural language
processing techniques. The improvement of micro view approach can con-
tribute the improvement of macro view approach.
z We need to implement the macro view method in a discovered rule filtering
system and apply it to an application of hepatitis data mining.
Acknowledgement
This work is supported by a grant-in-aid for scientific research on priority area by the
Japanese Ministry of Education, Science, Culture, Sports and Technology.
References
1. H. Motoda (Ed.), Active Mining: New Directions of Data Mining, IOS Press, Amsterdam,
2002.
2. R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison Wesley,
1999.
3. Y. Kitamura, K. Park, A. Iida, and S. Tatsumi. Discovered Rule Filtering Using Information
Retrieval Technique. Proceedings of International Workshop on Active Mining, pp. 80-84,
2002.
4. Y. Kitamura, A. Iida, K. Park, and S. Tatsumi, Discovered Rule Filtering System Using
MEDLINE Information Retrieval, JSAI Technical Report, SIG-A2-KBS60/FAI52-J11,
2003.
Micro View and Macro View Approaches to Discovered Rule Filtering 91
1 Introduction
Inductive learning of first-order theory from examples is interesting because first-
order representation provides comprehensibility to the learning results and ca-
pability to handle more complex data consisting of relations. Yet, the bottleneck
we apply the proposed system to learn hypotheses from this kind of data. We
compare the learning results with the previous approaches in order to evaluate
the performance.
This paper is mainly divided into two parts. We first introduce the multiple-
part data, and describe the proposed approach. Then, we conduct the exper-
iments for SAR studies on two chemical compound datasets. The experiment
results are compared to the existing approaches to evaluate its performance.
Finally, we conclude the paper and consider our future works.
2 Background
2.1 FOIL
FOIL [1] is a top-down ILP system which learns function-free Horn clause def-
initions of a target predicate using background predicates. Learning process in
FOIL starts with training examples containing all positive and negative exam-
ples. The algorithm used in FOIL for constructing a function-free Horn clause
consists of two main loops. In outer loop, a Horn clause partially covering the
examples is constructed, and covered examples are removed from the training
set. While in inner loop, partially developed clauses are iteratively refined by
adding a literal one by one. Heuristic function is used to select the most appro-
priate clause. FOIL maintains covered examples in the form of tuple which is
the substitutions (i.e., bindings of variables) of the clause under given example.
Multiple tuples can be generated from one example.
FOIL uses a heuristic function based on the information theory for assessing
usefulness of a literal. It provides effective guidance for clause construction. Pur-
pose of this heuristic function is to characterize a subset of the positive examples.
From the partial developing clause below
R(V1 , V2 , . . . , Vk ) ← L1 , L2 , . . . , Lm−1
training tuples covered by this clause are denoted as Ti . The information required
for Ti is calculated from Ti+ and Ti− which denote positive and negative tuples
covered by the clause, respectively.
|Ti+ |
I(Ti ) = − log2 (1)
|Ti+ | + |Ti− |
If a literal Lm is selected and added, a new set of covered tuples Ti+1 is
created, then similar formula is given as
|Ti+1
+
|
I(Ti+1 ) = − log2 − (2)
|Ti+1
+
| + |Ti+1 |
From above, a heuristic used in FOIL is calculated as an amount of informa-
tion gained when applying a new literal Lm ;
Gain(Li ) = |Ti++ | × (I(Ti ) − I(Ti+1 )) (3)
Mining Chemical Compound Structure Data 95
Ti++ is the positive tuples included in Ti and extended in Ti+1 . This heuristic
function is used over all candidate literals, and the literal with the largest value
is selected and added to the partial developed clause in inner loop.
+
+
- +
Target
-
Concept + -
+
+ +
+ -
- +
+ + +
+ + + +
+ +
+ +
- -
+
+
- +
- - +
+ +
instance
bag
negative instance. Assuming that the target concept is a single point t. The
Diverse Density is defined as:
Using Bayes’s rule and assuming a uniform prior over concept location again,
this is equivalent to
P r(x = t|Bi+ ) (x = t|Bi− )
i i
where x is a point in the feature space and Bij represents the j th instance of the
ith bag in training examples. For the distance, the Euclidean distance is adopted
then
Bij − x2 = (Bijk − xk )2 (5)
k
In the previous approaches, several searching techniques are proposed for
determining the value of features or the area in the feature space maximizing DD.
3 Proposed Method
We present top-down ILP system that is able to learn more efficiently hypotheses
from set of examples, each consisting of several small parts, or when trying to pre-
dict class of data from the common substructure. The proposed system incorpo-
rates existing top-down ILP system (FOIL) and applies multiple-instance based
measure to find common characteristics among parts of positive examples. This
measure is then used as a weight attached to each part of the example and the
common parts among positive examples are attached with high-valued weights.
With these weights and heuristic function based on example coverage, the system
generates more precise and higher coverage hypotheses from training examples.
Next, we define multiple-part data, and then, explain modified heuristics.
parameters denote the identification of data and part. The rest parameters are
used for attributes. For denoting a relation between parts, we use one predicate
for one relation in similar manner to a part. The predicate is written as
From equation 1, Ti+ and Ti− denote the set of positive and negative tuples
respectively, as DD value can be used to show the importance of the part of
data by representing each instance of multiple-part data as a bag and each
part as an instance in the bag. The distance between two parts is calculated
by first constructing a vector p for each part where pi denotes Attr-i in the
part predicate explained in the previous section. Then, equation 5 is used to
calculate distance between two attribute vectors. From Fig. 4, distance between
atom(c1,a1,c,-1.2) and atom(c1,a2,o,1.3) is computed by constructing two
vectors [c, −1.2] and [o, 1.3] and using equation 5. To compute DD value of each
atom, distances between all atom pairs are calculated first. Then, x in equation
−
4 is assigned to the vector of atom being considered. ||Bij
+
− x||2 and ||Bij − x||2
are obtained from the computed distances.
a1
a2 a3 a4
a5
a1 a2
a6 a3
a5 a4
DDs (Ti+ )
I(Ti ) = − log2 (8)
DDs (Ti+ ) + |Ti− |
Gain(Li ) = DDs (Ti++ ) × (I(Ti ) − I(Ti+1 )) (9)
This function weighs each part with DD value and uses the sum of these
weights to select the literal, while the original heuristic function weighs all parts
with the same value as 1. Nevertheless, we still use the number of negative tuples
|Ti− | in the same way as the original heuristics, because we know that all parts
of negative examples show the same strength. Therefore, it weighs all negative
parts with value 1.
From the above function, one problem is left to be considered. Each tuple
may consist of more than one part. The algorithm has to calculate DD value of a
relation among parts, e.g. a bond makes each tuple contains two atoms. We then
have to select the weight to represent each tuple from DD value of the parts.
We solve this problem by simply selecting average DD value in the tuple as the
weight of tuple (equation 6).
3.3 Algorithm
From this modified function, we implement the prototype system called FOILmp
(FOIL for Multiple-Part data). This system basically uses the same algorithm
as proposed in [1]. Nevertheless, in order to construct accurate hypotheses, beam
search is applied so that the algorithm maintains a set of good candidates instead
of selecting the best candidate at that time. This searching method enables
the algorithm to backtrack to the right direction and finally get to the goal.
Moreover, in order to obtain rules with high coverage, we define coverage ratio,
and the algorithm is set to select only the rules covering positive examples higher
than the coverage ratio. Fig. 5 shows the main algorithm used in the proposed
FOILmp
– T heory ← ∅
– Remaining ← P ositive(Examples)
– While not StopCriterion(Examples, Remaining)
• Rule ← FindBestRule(Examples, Remaining)
• T heory ← T heory ∪ Rule
• Covered ← Cover(Remaining, Rule)
• Remaining ← Remaining − Covered
FindBestRule(Examples, Remaining)
Fig. 6. Algorithm for finding the best rule from the remaining positive examples
system. This algorithm starts by initialising the set T heory to null, and the set
Remaining to the set of positive examples. The algorithm loops to find rules
and add each rule found to T heory until all positive examples are covered. The
modified subroutine for selecting rules is shown in Fig. 6. There are two user-
defined parameters: ε for the minimum accuracy and γ for the minimum positive
example coverage.
4 Experiments
We conducted experiments on two datasets for SAR: Mutagenicity and Dopamine
antagonist data. To evaluate performance of the proposed system, these exper-
iments are conducted in ten-fold cross validation manner and we compare the
results to the existing approaches.
4.1 Dataset
In this research, we aim to discover rules describing the activities of chemical
compounds from their structures. Two kinds of SAR data were studied: muta-
genesis dataset [7] and dopamine antagonist dataset.
– atm(comp, atom, element, type, charge), stating that there is the atom atom
in the compound comp that has element element of type and partial charge
charge.
– bond(comp, atom1, atom2, type), describing that there is a bond of type
between the atoms atom1 and atom2 in the compound comp.
The background knowledge in this dataset is already formalized in the form
of multiple-part data, and thus, no preprocessing is necessary.
features to characterize each atom. After discussing with the domain expert,
the other features based on basic knowledge in chemistry are added: number of
bonds linked to an atom, average length of bonds linked to an atom, connection
to oxygen atom, minimum distance to oxygen and nitrogen.
Most of features are related to oxygen and nitrogen because the expert said
that the position of oxygen and nitrogen has an effect to the activity of dopamine
antagonist. Hence, the predicate atm is modified to atm(compound, atom, ele-
ment, number-bond, avg-bond-len, o-connect, o-min-len, n-min-len).
Although, the proposed method can handle only two-class data (only positive
or negative), there are four classes for the dopamine antagonist compounds,
however. Then, hypotheses for each class are learned by the one-against-the-rest
Predicted Predicted
active inactive active inactive
Actual Actual
inactive 13 50 63 inactive 8 55 63
Predicted Predicted
active inactive active inactive
Actual Actual
inactive 7 56 63 inactive 8 55 63
Rule1:
Table 1. Ten-fold cross validation test comparing the accuracy on Mutagenesis data
Approach Accuracy
The proposed method 0.82
PROGOL 0.76
FOIL 0.61
Fig. 8 shows example of rules obtained from FOILmp. We found that FOILmp
obtains rules with high coverage, such as, the first rule can cover around 50% of
the all positive examples.
Table 1 shows experimental results on Mutagenicity data. Prediction accu-
racy on test examples using ten-fold cross validation is compared to the existing
approaches (FOIL and PROGOL). It shows that the proposed method can pre-
dict more accurately than the existing approaches.
Predicted Predicted
positive negative positive negative
Actual Actual
FOILmp Aleph
Activity Accuracy(%) Accuracy(%) Accuracy(%) Accuracy(%)
(overall) (only positive) (overall) (only positive)
D1 97.0 85.5 96.0* 78.6**
D2 88.1 79.1 86.4* 70.5*
D3 93.4 78.4 93.1 75.1*
D4 88.4 85.1 87.6* 83.2*
Mining Chemical Compound Structure Data 107
Rule 1
Rule 1
(a) FOILmp
(b) Aleph
Fig. 12. Visualization of a molecule described by Rule 1 from FOILmp and Aleph
this dataset in reasonable time. Aleph is an ILP system based on inverse en-
tailment and similar algorithm with PROGOL. However, Aleph has adopted
several search strategies, such as randomized search which helps improve the
performance of the system. In this experiment, we set Aleph to use GSAT [9],
which is one of the randomized search algorithms where the best results can
be generated.
Fig. 9 shows the performance table comparing the experimental results in the
first fold. Table 2 shows the prediction accuracy computed for both positive and
negative examples, and then, for only the positive examples. The table also shows
the results of significance test using a one-paired t-test. The experiment results
show that FOILmp predicts more accurately than Aleph in both accuracy compu-
tation methods. The significance tests also show the confidence level in the differ-
ence between accuracy. Fig. 10 and 11 show details of rules obtained by FOILmp
and Aleph respectively. We also found that FOILmp generates rule with higher
coverage than Aleph where the rule covers 36.5% of positive examples.
5 Related Work
In recent years, many researches were made to learn from chemical compound
structure data because learning results can be applied directly to produce new
drugs for curing some difficult diseases. Muggleton, Srinivasan and King [7, 10,
11] proposed the approach that applies PROGOL to predict several datasets
including mutagenicity of chemical compounds used in our experiments.
King et al. also discussed whether propositional learner or ILP is better for
learning from chemical structure [10]. Actually, the first-order representation
can denote chemical structure without losing any information. Since denoting
the relational data using propositional logic is beyond its limit, some special
techniques are required, e.g., for relations among parts, we may use only average
value of features or use domain-related knowledge to calculate a new feature
for categorization. However, a propositional learner can perform better than a
learner using first-order representation because ILP learners have some restric-
tions from the logic theory. However, comparing only accuracy may not be good
assessment because chemist’s natural inclination is related to chemical structure
and the learning results from ILP is comprehensible to chemists.
However, King et al. [10] reviewed four case studies related to SAR studies:
inhibition of dihydrofolate reductase by pyrimidines, inhibition of dihydrofolate
reductase by triazines, design of tacrine analogues, and mutagenicity of nitroaro-
matic and heteroaromatic compounds. The experimental results are compared
with two propositional learner: Regression, a linear regression technique and
CART, a decision tree learner. They found that with these chemical structure
data, propositional learners with limited number of features in one instance are
sufficient to all problems. However, when more complex chemical structures and
background knowledge are added, propositional representations become unman-
ageable. Therefore, first-order representations would provide more possibility
with more comprehensible results.
110 C. Nattee et al.
References
1. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning 5
(1990) 239–266
2. Dietterich, T.G., Lathrop, R.H., Lozano-Perez, T.: Solving the multiple instance
problem with axis-parallel rectangles. Artificial Intelligence 89 (1997) 31–71
Mining Chemical Compound Structure Data 111
3. Wang, J., Zucker, J.D.: Solving the multiple-instance problem: A lazy learning
approach. In: Proc. 17th International Conf. on Machine Learning, Morgan Kauf-
mann, San Francisco, CA (2000) 1119–1125
4. Chevaleyre, Y., Zucker, J.D.: A framework for learning rules from multiple instance
data. In: Proc. 12th European Conf. on Machine Learning. Volume 2167 of LNCS.,
Springer (2001) 49–60
5. Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In:
Proc. 19th International Conf. on Machine Learning, Morgan Kaufmann (2002)
179–186
6. Maron, O., Lozano-Pérez, T.: A framework for multiple-instance learning. In
Jordan, M.I., Kearns, M.J., Solla, S.A., eds.: Advances in Neural Information Pro-
cessing Systems. Volume 10., The MIT Press (1998)
7. Srinivasan, A., Muggleton, S., King, R., Sternberg, M.: Mutagenesis: ILP exper-
iments in a non-determinate biological domain. In Wrobel, S., ed.: Proc. 4th In-
ternational Workshop on Inductive Logic Programming. Volume 237., Gesellschaft
für Mathematik und Datenverarbeitung MBH (1994) 217–232
8. Srinivasan, A.: The Aleph manual (2001) http://web.comlab.ox.ac.uk/oucl/-
research/areas/machlearn/Aleph/.
9. Selman, B., Levesque, H.J., Mitchell, D.: A new method for solving hard satisfiabil-
ity problems. In: Proceedings 10th National Conference on Artificial Intelligence.
(1992) 440–446
10. King, R.D., Sternberg, M.J.E., Srinivasan, A.: Relating chemical activity to struc-
ture: An examination of ILP successes. New Generation Computing 13 (1995)
411–433
11. Srinivasan, A., Muggleton, S., Sternberg, M.J.E., King, R.D.: Theories for muta-
genicity: A study in first-order and feature-based induction. Artificial Intelligence
85 (1996) 277–299
12. Weidmann, N., Frank, E., Pfahringer, B.: A two-level learning method for gen-
eralized multi-instance problems. In: Proceedings of the European Conference on
Machine Learning (ECML-2003). (2003) 468–479
13. McGovern, A., Jensen, D.: Identifying predictive structures in relational data using
multiple instance learning. In: Proceedings of the 20th International Conference
on Machine Learning (ICML-2003). (2003)
14. Chevaleyre, Y., Zucker, J.D.: Solving multiple-instance and multiple-part learning
problems with decision trees and decision rules: Application to the mutagenesis
problem. Technical report, LIP6-CNRS, University Paris VI (2000)
15. Zucker, J.D.: Solving multiple-instance and multiple-part learning problems with
decision trees and rule sets. application to the mutagenesis problem. In: Proceed-
ings of Canadian Conference on AI 2001. (2001) 204–214
First-Order Rule Mining by Using Graphs
Created from Temporal Medical Data
1 Introduction
Hospital information systems that store medical data are very popular, especially
in large hospitals. Such systems hold patient medical records, laboratory data,
and other types of information, and the knowledge extracted from such medical
data can assist physicians in formulating treatment strategies. However, the
volume of data is too large to allow efficient manual extraction of data. Therefore,
physicians must rely on computers to extract relevant knowledge.
Medical data has three notable features [14]; namely, the number of records
increases each time a patient visits a hospital; values are often missing, usually
because patients do not always undergo all examinations; and the data include
time-series attributes with irregular time intervals. To handle medical data, a
mining system must have functions that accommodate these features. Methods
for mining data include K-NN, decision trees, neural nets, association rules, and
genetic algorithms [1]. However, these methods are unsuitable for medical data,
in view of the inclusion of multiple relationships and time relationships with
irregular intervals.
Inductive Logic Programming (ILP) [4] is an effective method for handling
multiple relationships, because it uses horn clauses that constitute a subset of
first order logic. However, ILP is difficult to apply to data of large volume, in view
of computational cost. We propose a new graph-based algorithm for inducing
horn clauses for representing temporal relations from data in the manner of ILP
systems. The method can reduce computational cost of exploring in hypothesis
space. We apply this system to a medical data mining task and demonstrate the
performance in identifying temporal knowledge in the data.
This paper is organized as follows. Section 2 characterizes the medical data
with some examples. Section 3 describes related work in time-series data and
medical data. Section 4 presents new temporal relationship mining algorithms
and mechanisms. Section 5 applies the algorithms to real-world medical data to
demonstrate our algorithm’s performance, and Section 6 discusses our experi-
mental result and methods. Finally, in Section 7 we present our conclusions.
2 Medical Data
As described above, the sample medical data shown here have three notable
features. Table 1 shows an example laboratory examination data set including
seven attributes. The first attribute, ID, means personal identification. The sec-
ond is Examination Date, which is the date the patient consults a physician.
The remaining attributes designate results of laboratory tests.
The first feature shows that the data contain a large number of records.
The volume of data in this table increases quickly, because new records having
numerous attributes are added every time a patient undergoes an examination.
The second feature is that many values are missing from the data. Table 1
shows that many values are absent from the attributes that indicate the results
of laboratory examinations. Since this table is an extract from medical data, the
number of missing values is quite low. However, in the actual data set this number
is far higher. That is, most of the data are missing values, because each patient
undergoes only some tests during the course of one examination. In addition,
Table 1 does not contain data when laboratory tests have not been conducted.
This means that the data during the period 1983/12/13 to 1984/01/22 for patient
ID 14872 can also be considered missing values.
114 R. Ichise and M. Numao
The other notable feature of the medical data is that it contains time-series at-
tributes. When a table does not have these attributes, then the data contain only
a relationship between ID and examination results. Under these circumstances,
the data can be subjected to decision tree learning or any other propositional
learning method. However, relationships between examination test dates are also
included; that is, multiple relationships.
3 Related Work
These kinds of data can be handled by any of numerous approaches. We sum-
marize related work for treating such data from two points of view: time-series
data and medical data.
Data 1 2 3 4
ID1 50 100 50 50
ID2 50 50 0 50
ID1’ - 100 50 50
ID1” 50 - 50 50
50 50
0 1 2 3 4 0 1 2 3 4
Time Time
100 100
50 50
0 1 2 3 4 0 1 2 3 4
Time Time
developed in order to obtain knowledge from such data. The temporal description
is usually converted into attribute features by some special function or dynamic
time warping method. Subsequently, the attribute feature in a standard machine
learning method such as decision tree [15] or clustering [2] is used. Since these
methods do not treat the data directly, the obtained data can be biased by
summarization of the temporal data.
Another approach incorporates InfoZoom [13], which is a tool for visualization
of medical data in which the temporal data are shown to a physician, and the
physician tries to find knowledge from medical data interactively. This tool lends
useful support for the physician, but does not induce knowledge by itself.
The arguments denote the patient ID, kind of laboratory test, value of the
test, beginning date of the period being considered, and ending date of the period
being considered, respectively. This predicate returns true if all tests conducted
within the period have a designated value. For example, the following predicate
is true if patient ID 618 had at least one GOT test from Oct. 10th 1982 to Nov.
5th 1983, and all tests during this period yield very high values.
1. If E + = φ, return R
2. Construct clause H by using the internal loop algorithm
3. Let R = R ∪ H
4. Goto 1
H is a hypothesis.
4.4 Refinement
In our method, the search space for the hypothesis is constructed by combina-
tions of predicates described in Section 4.2. Suppose that the number of the
kinds of tests is Na , the number of test domains is Nv , and the number of date
possibilities is Nd . Then, the number of candidate literals is Na × Nv × Nd2 /2.
As we described in Section 2, because medical data consist of a great number
of records, the computational cost for handling medical data is also great. How-
ever, medical data have many missing values and consequently, often consist of
sparse data. When we make use of this fact, we can reduce the search space and
computational cost.
To create candidate literals which are used for refinement, we propose employ-
ing graphs created from temporal medical data. The purpose of this literal cre-
ation is to find literals which cover many positive examples. In order to find them,
a graph is created from positive examples. The nodes in the graph are defined by
each medical data record and the node has four labels; i.e. patient ID, laboratory
test name, laboratory test value, and date test conducted. Arcs are created for
each node. Suppose that two nodes represented by n(Id0 , Att0 , V al0 , Dat0 ) and
n(Id1 , Att1 , V al1 , Dat1 ) exist. The arc is created if all the following conditions
hold:
118 R. Ichise and M. Numao
n(23,got,vh,80)
n(31,got,vh,72)
n(31,got,vh,84)
n(35,got,vh,74)
n(23,got,vh,80)
n(31,got,vh,72)
n(31,got,vh,84)
n(35,got,vh,74)
– Id0 = Id1
– Att0 = Att1
– V al0 = V al1
– For all D{D ≥ Dat0 ∧ D ≤ Dat1 },
if n(Id0 , Att0 , V al, D) exists, V al = V al0
and
if n(Id1 , Att1 , V al, D) exists, V al = V al1
For example, supposing that we have data shown in Table 5, we can obtain
the graph shown in Figure 2.
After constructing the graph, the arcs are deleted by the following reduction
rules:
After deleting all arcs for which the above conditions hold, we can obtain
the maximum period which contains positive examples. Then we pick up the
remaining arcs and set the node date as BeginingDate and EndingDate. Af-
ter applying the deletion rules for the graph shown in Figure 2, we obtain
the graph shown in Figure 3. Then the final literal candidate for refinement is
blood test(Id, got, veryhigh, 72, 80) and blood test(Id, got, veryhigh, 74, 84),
which in this case covers all three patients.
5 Experiment
5.1 Experimental Settings
In order to evaluate the proposed algorithm, we conducted experiments on real
medical data donated from Chiba University Hospital. These medical data con-
tains data of hepatitis patients, and the physician requires us to find an effective
timing for starting interferon therapy. Interferon is a kind of medicine for re-
ducing the hepatitis virus. It has great effect for some patients; however, some
patients exhibit no effect, and some patients exhibit deteriorated condition. Fur-
ther, the medicine is expensive and has side effects. Therefore, physicians wants
to know the effectiveness of Interferon before starting the therapy. According to
our consulting physician, the effectiveness of the therapy could be changed by
the patient’s condition and could be affected by the timing for starting it.
We input the data of patients whose response is complete as positive ex-
amples, and the data of the remaining patients as negative examples. Complete
response is judged by virus tests and under advice from our consulting physician.
The number of positive and negative examples are 57 and 86, respectively. GOT,
GPT, TTT, ZTT, T-BIL, ALB, CHE, TP, T-CHO, WBC, and PLT, which are
attributes of the blood test, were used in this experiment. Each attribute value
was discretized by the criteria suggested by the physician. We treat the starting
date for interferon therapy as base date, in order to align data from different
patients. According to the physician, small changes in blood test results can be
ignored. Therefore, we consider the predicate blood test to be true if the per-
centage p, which is set by parameter, of the blood tests in the time period show
the specified value.
5.2 Result
Since not all results can be explained, we introduce only five of the rules obtained
by our system.
inf_effect(Id):-
blood_test(Id,wbc,low,149,210). (1)
WBC/-149--210/low[3:4]/1.0
8
37
50
226
7 275
307
347
704
6 913
923
WBC Value
937
5 952
2
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
TP/-105--219/high[8.2:9.2]/1.0
10
667
671
750
9.5 774
898
945
9
TP Value
8.5
7.5
7
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
test data for effective patients are shown in Figure 4. The number of the
graph line represents patient ID, and the title of the graph represents
test-name/period/value[low:high]/parameter p, respectively.
inf_effect(Id):-
blood_test(Id,tp,high,105,219). (2)
interferon therapy was effective for 6 of these. The rule held for 10.5
percent of the effective patients and had 85.7 percent accuracy. The
blood test data for effective patients are shown in Figure 5.
inf_effect(Id):-
blood_test(Id,wbc,normal,85,133),
blood_test(Id,tbil,veryhigh,43,98). (3)
WBC/-85--133/normal[4:9]/0.8
347
14 404
592
737
12 760
10
WBC Value
0
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
T-BIL/-43--98/veryhigh[1.1:1.5]/0.8
5
347
404
592
737
4 760
3
T-BIL Value
0
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
test data for effective patients are shown in Figure 6. The patients who
satisfied both test in rule (3) are positive patients. This means that we
have to view both graphs in Figure 6 simultaneously.
inf_effect(Id):-
blood_test(Id,tbil,medium,399,499). (4)
T-BIL/-399--499/medium[0.6:0.8]/0.8
2
78
730
774
944
953
1.5
T-BIL Value
0.5
0
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
GOT/-30--57/ultrahigh[200:3000]/0.8
500
547
450 671
690
894
400 945
350
300
GOT Value
250
200
150
100
50
0
-500 -450 -400 -350 -300 -250 -200 -150 -100 -50 0
Date
of the effective patients and had 83.3 percent accuracy. The blood test
data for effective patients are shown in Figure 7.
inf_effect(Id):-
blood_test(Id,got,ultrahigh,30,57). (5)
6 Discussion
The results of our experiment demonstrate that our method successfully induces
rules with temporal relationships in positive examples. For example, in Figure 4,
the value range of WBC for the patients is wide except for the period between
149 and 210, but during that period, patients exhibiting interferon effect have
the same value. This implies that our system can discover temporal knowledge
within the positive examples.
We showed these results to our consulting physician. He stated that if the
rule specified about half a year before therapy starting day, causality between
the phenomena and the result would be hard to imagine. It was necessary to
find a connection between them during the period. In relation to the third rule,
he also commented that the hypothesis implies that temporary deterioration in
a patient’s condition would indicate the desirability to start interferon therapy
with complete response.
The current system utilizes a cover set algorithm to induce knowledge. This
method starts from finding the largest group in positive examples, then pro-
gresses to find smaller groups. According to our consulting physician, the pa-
tients could be divided into groups, even within the interferon effective patients.
One method for identifying such groups is the subgroups discovery method [5],
which uses a covering algorithm involving example weighting for rule set con-
struction. When it is used in place of the cover set algorithm, this method could
assist the physician.
In its present form, our method uses only the predicate defined in Section 4.2.
When we use this predicate with the same rule and different time periods, we
can represent movement of blood test values. However, this induction is some-
what difficult for our system, because each literal is treated in each refinement
step separately. This means that the possibility to obtain rules including move-
ment of blood test is very small. This is a current limitation for representing the
movement of blood tests. Rodrı́gues et al. [12] also propose other types of tem-
poral literals. As we mentioned previously, the hypothesis space constructed by
the temporal literal requires a high computational cost for searching, and only
a limited hypothesis space is explored. In this paper, we propose inducing the
literals efficiently by using graph representation of hypothesis space. We believe
that we can extend this approach to other types of temporal literals.
124 R. Ichise and M. Numao
7 Conclusion
In this paper, we propose a new data mining algorithm. The performance of
the algorithm was tested experimentally by use of real-world medical data. The
experimental results show that this algorithm can induce knowledge about tem-
poral relationships from medical data. The temporal knowledge is hard to obtain
by existing methods, such as a decision tree. Furthermore, physicians have shown
interest in the rules induced by our algorithm.
Although our results are encouraging, several areas of research remain to be
explored. As shown in Section 5.2, our system induces hypothesis regardless of
the causality. We must bias the induction date period to suit the knowledge of
our consulting physicians. In addition, our algorithm must be subjected to exper-
iments with different settings. We plan to apply this algorithm to other domains
of medical data and also apply it to non-medical, temporal data. Extensions to
treating numerical values also must be investigated. Our current method require
attributes in discrete values. We plan to investigate these points in our future
work.
Acknowledgments
We are grateful to Hideto Yokoi for fruitful discussions.
References
1. Adriaans, P., & Zantinge, D. (1996). Data Mining. London: Addison Wesley.
2. Baxter, R., Williams, G., & He, H. (2001). Feature Selection for Temporal Health
Records. Lecture Notes in Computer Science. 2035, 198–209.
3. Das, D., Lin, K., Mannila, H., Renganathan, G. & Smyth, P. (1998). Rule Discov-
ery from Time Series. In Proceedings of the Fourth International Conference on
Knowledge Discovery and Data Mining (pp. 16–22).
4. Džeroski, S. & Lavrač, N. (2001). Relational Data Mining. Berlin: Springer.
5. Gamberger, D., Lavrač, N., & Krstac̆ić, G. (2003). Active subgroup mining: a
case study in coronary heart disease risk group detection, Artificial Intelligence in
Medicine, 28, 27–57.
6. Ichise, R., & Numao, M. (2001). Learning first-order rules to handle medical data.
NII Journal, 2, 9–14.
7. Keogh, E., & Pazzani, M. (2000). Scaling up Dynamic Time Warping for Datamin-
ing Applications, In the Proceedings of the Sixth International Conference on
Knowledge Discovery and Data Mining (pp. 285–289)
8. Kononenko, I. (2001). Machine learning for medical diagnosis: history, state of the
art and perspective, Artificial Intelligence in Medicine, 23, 89–109.
9. Motoda, H. editor. (2002) Active mining: new directions of data mining. In: Fron-
tiers in artificial intelligence and applications, 79. IOS Press.
10. Muggleton, S., & Firth, J. (2001). Relational rule induction with CProgol4.4: a
tutorial introduction, Relational Data Mining (pp. 160–188).
11. Quinlan, J. R. (1990). Learning logical definitions from relation. Machine Learning,
5, 3, 239–266.
First-Order Rule Mining by Using Graphs 125
12. Rodrı́guez, J. J., Alonso, C. J., & Bostrø̈m, H. (2000). Learning First Order Logic
Time Series Classifiers: Rules and Boosting. Proceedings of the Fourth European
Conference on Principles and Practice of Knowledge Discovery in Databases (pp.
299–308).
13. Spenke, M. (2001). Visualization and interactive analysis of blood parameters with
InfoZoom. Artificial Intelligence in Medicine, 22, 159–172.
14. Tsumoto, S. (1999). Rule Discovery in Large Time-Series Medical Databases. Pro-
ceedings of Principles of Data Mining and Knowledge Discovery: Third European
Conference (pp. 23–31).
15. Yamada, Y., Suzuki, E., Yokoi, H., & Takabayashi, K. (2003) Classification by
Time-series Decision Tree, Proceedings of the 17th Annual Conference of the
Japanese Society for Artificial Intelligence, in Japanese, 1F5-06.
Extracting Diagnostic Knowledge
from Hepatitis Dataset
by Decision Tree Graph-Based Induction
1 Introduction
Viral hepatitis is a very critical illness. If it is left without undergoing a suitable
medical treatment, a patient may suffer from cirrhosis and fatal liver cancer. The
progress speed of the condition is slow and subjective symptoms are not noticed
easily. Hence, in many cases, it has already become very severe when subjective
symptoms are noticed. Although periodic inspection and proper treatment are
important in order to prevent this situation, there are problems of expensive cost
and physical burden on a patient. There is an alternative much cheaper method
of inspection such as blood test and urinalysis. However, the amount of data
becomes enormous since the progress speed of the condition is slow.
1
1 4 4
3 11 11
3 2
2
7 7 8 7 7 8
1 5
5 3 11
2 5 4 5
4
6 9 6 9
1 1
3 3 10 11
2 2 2
GBI(G)
Enumerate all the pairs Pall in G
Select a subset P of pairs from Pall (all the pairs in G)
based on typicality criterion
Select a pair from Pall based on chunking criterion
Chunk the selected pair into one node c
Gc := contracted graph of G
while termination condition not reached
P := P ∪ GBI(Gc )
return P
Fig. 2. Algorithm of GBI
In Figure 1 the pattern 1→3 must be typical for the pattern 2→10 to be typical.
Said differently, unless pattern 1→3 is chunked, there is no way of finding the
pattern 2→10. The frequency measure satisfies this monotonicity. However, if the
criterion chosen does not satisfy this monotonicity, repeated chunking may not
find good patterns even though the best pair based on the criterion is selected
at each iteration. To resolve this issue GBI was improved to use two criteria, one
for frequency measure for chunking and the other for finding discriminative pat-
terns after chunking. The latter criterion does not necessarily hold monotonicity
property. Any function that is discriminative can be used, such as Information
Gain [6], Gain Ratio [7] and Gini Index [1], and some others.
The improved stepwise pair expansion algorithm is summarized in Figure 2. It
repeats the following four steps until the chunking threshold is reached (normally
minimum support value is used as the stopping criterion).
Step 1. Extract all the pairs consisting of connected two nodes in the graph.
Step 2a. Select all the typical pairs based on the typicality criterion from among
the pairs extracted in Step 1, rank them according to the criterion and register
them as typical patterns. If either or both nodes of the selected pairs have
already been rewritten (chunked), they are restored to the original patterns
before registration.
Step 2b. Select the most frequent pair from among the pairs extracted in Step 1
and register it as the pattern to chunk. If either or both nodes of the selected
pair have already been rewritten (chunked), they are restored to the original
patterns before registration. Stop when there is no more pattern to chunk.
Step 3. Replace the selected pair in Step 2b with one node and assign a new
label to it. Rewrite the graph by replacing all the occurrence of the selected
pair with a node with the newly assigned label. Go back to Step 1.
The output of the improved GBI is a set of ranked typical patterns extracted
at Step 2a. These patterns are typical in the sense that they are more discrimi-
native than non-selected patterns in terms of the criterion used.
⎛ ⎞
a11 a12 . . . a1n code(A) = a11 a12 a22 a13 a23 . . . ann (1)
⎜ a22 . . . a2n ⎟
⎜ ⎟
n j
A=⎜ .. .. ⎟ {(
n
k)+j−i}
⎝ . . ⎠ = (L + 1) k=j+1 aij . (2)
ann j=1 i=1
1
3
2
Y N
1
1 3 4
1 4
1 3 3 4 6
2 1 6 3
2 3 7 3 8 4
1 2
2 7 31
1 7 3 8 4 2
Y N Y N
1 23
2 7 7 8
4
5 31 5
7 23
2 2 7 8
5 4 1 5
7 2 3
5 4 6 19 5 4
2 3 3
5 4 6 3
Class A9 5
4
6
2 9
Class A class A class B 6
1
2
4
9
Y N
6 Class B
Class C Y N
Graph data
Decision tree class C class A class B class C
DT-GBI(D)
Create a node DT for D
if termination condition reached
return DT
else
P := GBI(D) (with the number of chunking specified)
Select a pair p from P
Divide D into Dy (with p) and Dn (without p)
Chunk the pair p into one node c
Dyc := contracted data of Dy
for Di := Dyc , Dn
DTi := DT-GBI(Di )
Augment DT by attaching DTi as its child along yes(no) branch
return DT
Fig. 5. Algorithm of DT-GBI
a a b a a c aa b aa c
c d d b d c c d d b d c
class A class B class A class B
b b b b a
aa
b b b b a
a
d a c c d aa c c
class A class C class A class C
aad
aa b b b aa
Y N
c d aad
aa
aad
aa class A class A aad
aa class C
aa c Y N
b d c class A class B
class B
Fig. 6. Example of decision tree construction by DT-GBI
Each time when an attribute (pair) is selected to divide the data, the pair
is chunked into a larger node in size. Thus, although initial pairs consist of two
nodes and the edge between them, attributes useful for classification task are
gradually grown up into larger pair (subgraphs) by applying chunking recur-
sively. In this sense the proposed DT-GBI method can be conceived as a method
for feature construction, since features, namely attributes (pairs) useful for clas-
sification task, are constructed during the application of DT-GBI.
Note that the criterion for chunking and the criterion for selecting a classify-
ing pair can be different. In the following experiments, frequency is used as the
evaluation function for chunking, and information gain is used as the evaluation
function for selecting a classifying pair3 .
3
We did not use information gain ratio because DT-GBI constructs a binary tree.
134 W. Geamsakul et al.
Graph aa ab ac ad ba bb bc bd cb cc da db dc
1 (class A) 1 1 0 1 0 0 0 1 0 0 0 0 1
2 (class B) 1 1 1 0 0 0 0 0 0 1 1 1 0
3 (class A) 1 0 0 1 1 1 0 0 0 0 0 1 0
4 (class C) 0 1 0 0 0 1 1 0 1 0 0 0 0
Graph a a b aa aa c aa aa d b aa aa b a ・・・ d c
1 (class A) 1 0 1 0 0 ・・・ 1
2 (class B) 1 1 0 0 0 ・・・ 0
3 (class A) 0 0 1 1 1 ・・・ 0
In this example, the pair “a→a” is selected. The selected pair is then chunked
in graph 1, graph 2 and graph 3 and these graphs are rewritten. On the other
hand, graph 4 is left as it is.
The above process is applied recursively at each node to grow up the decision
tree while constructing the attributes (pairs) useful for classification task at
the same time. Pairs in graph 1, graph 2 and graph 3 are enumerated and the
attribute-value tables are constructed as shown in Figure 8. After selecting the
pair “(a→a)→d”, the graphs are separated into two partitions, each of which
contains graphs of a single class. The constructed decision tree is shown in the
lower right-hand side of Figure 6.
4 Data Preprocessing
The dataset contains long time-series data (from 1982 to 2001) on laboratory
examination of 771 patients of hepatitis B and C. The data can be split into
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 135
H H H
Limiting Data Range. In our analysis it is assumed that each patient has one
class label, which is determined at some inspection date. The longer the interval
between the date when the class label is determined and the date of blood
inspection is, the less reliable the correlation between them is. We consider that
the pathological conditions remain the same for some duration and conduct the
analysis for the data which lies in the range. Furthermore, although the original
dataset contains hundreds of examinations, feature selection was conducted with
the expert to reduce the number of attributes. The duration and attributes used
depend on the objective of the analysis and are described in the following results.
Class Label Setting. In the first and second experiments, we set the result
(progress of fibrosis) of the first biopsy as class. In the third experiment, we
set the subtype (B or C) as class. In the fourth experiment, the effectiveness of
interferon therapy was used as class label.
5 Preliminary Results
5.1 Initial Settings
To apply DT-GBI, we use two criteria for selecting pairs as described before.
One is frequency for selecting pairs to chunk, and the other is information gain
[6] for finding discriminative patterns after chunking.
A decision tree was constructed in either of the following two ways: 1) apply
chunking nr =20 times at the root node and only once at the other nodes of a
decision tree, 2) apply chunking ne =20 times at every node of a decision tree.
Decision tree pruning is conducted by postpruning: conduct pessimistic pruning
by setting the confidence level to 25%.
We evaluated the prediction accuracy of decision trees constructed by DT-
GBI by the average of 10 runs of 10-fold cross-validation. Thus, 100 decision trees
were constructed in total. In each experiment, to determine an optimal beam
width b, we first conducted 9-fold cross-validation for one randomly chosen 9
folds data (90% of all data) of the first run of the 10 runs of 10-fold cross-
validation varying b from 1 to 15, and then adopted the narrowest beam width
that brings to the lowest error rate.
In the following subsections, both the average error rate and examples of
decision trees are shown in each experiment together with examples of extracted
patterns. Two decision trees were selected out of the 100 decision trees in each
experiment: one from the 10 trees constructed in the best run with the lowest
error rate of the 10 runs of 10-fold cross validation, and the other from the 10
trees in the worst run with the highest error rate. In addition, the contingency
tables of the selected decision trees for test data in cross validation are shown
as well as the overall contingency table, which is calculated by summing up the
contingency tables for the 100 decision trees in each experiment.
from 500 days before to 500 days after the first biopsy were extracted for the
analysis of the biopsy result. When biopsy was operated for several times on
the same patient, the treatment (e.g., interferon therapy) after a biopsy may
influence the result of blood inspection and lower the reliability of data. Thus,
the date of first biopsy and the result of each patient are searched from the
biopsy data file. In case that the result of the second biopsy or after differs from
the result of the first one, the result from the first biopsy is defined as the class
of that patient for the entire 1,000-day time-series.
Fibrosis stages are categorized into five stages: F0 (normal), F1, F2, F3,
and F4 (severe = cirrhosis). We constructed decision trees which distinguish
the patients at F4 stage from the patients at the other stages. In the following
two experiments, we used 32 attributes. They are: ALB, CHE, D-BIL, GOT,
GOT SD, GPT, GPT SD, HBC-AB, HBE-AB, HBE-AG, HBS-AB, HBS-AG,
HCT, HCV-AB, HCV-RNA, HGB, I-BIL, ICG-15, MCH, MCHC, MCV, PLT,
PT, RBC, T-BIL, T-CHO, TP, TTT, TTT SD, WBC, ZTT, and ZTT SD. Ta-
ble 1 shows the size of graphs after the data conversion.
As shown in Table 1, the number of instances (graphs) in cirrhosis (F4) stage
is 43 while the number of instances (graphs) in non-cirrhosis stages (F0 + F1
+ F2 + F3) is 219. Unbalance in the number of instances may cause a biased
decision tree. In order to relax this problem, we limited the number of instances
to the 2:3 (cirrhosis:non-cirrhosis) ratio which is the same as in [12]. Thus, we
used all instances from F4 stage for cirrhosis class (represented as LC) and select
65 instances from the other stages for non-cirrhosis class(represented as non-LC),
108 instances in all. How we selected these 108 instances is described below.
Experiment 1: F4 Stage vs {F0+F1} Stages
All 4 instances in F0 and 61 instances in F1 stage were used for non-cirrhosis
class in this experiment. The beam width b was set to 15. The overall result is
summarized in the left half of Table 2. The average error rate was 15.00% for
nr =20 and 12.50% for ne =20. Figures 11 and 12 show one of the decision trees
each from the best run with the lowest error rate (run 7) and from the worst run
with the highest error rate (run 8) for ne =20, respectively. Comparing these two
decision trees, we notice that three patterns that appeared at the upper levels of
each tree are identical. The contingency tables for these decision trees and the
overall one are shown in Table 3. It is important not to diagnose non-LC patients
as LC patients to prevent unnecessary treatment, but it is more important to
classify LC patient correctly because F4 (cirrhosis) stage might lead to hepatoma.
Table 3 reveals that although the number of misclassified instances for LC (F4)
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 139
Table 3. Contingency table with the number of instances (F4 vs. {F0+F1})
Predicted Class
Actual decision tree in Figure 11 decision tree in Figure 12 Overall
Class LC non-LC LC non-LC LC non-LC
LC 3 1 4 1 364 66
non-LC 1 5 4 3 69 581
LC = F4, non-LC = {F0+F1}
and non-LC ({F0+F1}) are almost the same, the error rate for LC is larger than
that for non-LC because the class distribution of LC and non-LC is unbalanced
(note that the number of instances is 43 for LC and 65 for non-LC). The results
are not favorable in this regards. Predicting minority class is more difficult than
predicting majority class. This tendency holds for the remaining experiments. By
regarding LC (F4) as positive and non-LC ({F0+F1}) as negative, decision trees
constructed by DT-GBI tended to have more false negative than false positive.
Fig. 11. One of the ten trees from the best run in exp.1 (ne =20)
Fig. 12. One of the ten trees from the worst run in exp.1 (ne =20)
1 1
H TTT_SD N 8 months TTT_SD
later Info. gain = 0.2595
GPT I-BIL
LC (total) = 18
N HCT D-BIL N non-LC (total) = 0
MCHC ALB
T-CHO T-CHO
N L
N N
Fig. 13. Pattern 111 = Pattern 121 (if exist then LC)
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 141
N N
D-BIL D-BIL
2 months Info. gain = 0.0004
later LC (total) = 16
non-LC (total) = 40
ALB I-BIL
N N
Fig. 15. One of the ten trees from the best run in exp.2 (ne =20)
Fig. 16. One of the ten trees from the worst run in exp.2 (ne =20)
Table 4. Contingency table with the number of instances (F4 vs. {F3+F2})
Predicted Class
Actual decision tree in Figure 15 decision tree in Figure 16 Overall
Class LC non-LC LC non-LC LC non-LC
LC 3 1 2 2 282 148
non-LC 2 5 3 4 106 544
LC = F4, non-LC = {F3+F2}
N N 1
Fig. 17. Pattern 211 = Pattern 221 (if exist then LC)
H H 1
date ALB D-BIL GPT HCT I-BIL MCHC T-CHO TTT_SD ・・・
19930517 L N H N N N N 1 ・・・
19930716 L L H N N N N ・・・
19930914 L L H N N N N 1 ・・・
19931113 L N H N N N N ・・・
19940112 L L H N N N N 1 ・・・
19940313 L N N N N N N 1 ・・・
19940512 L N H N N N N 1 ・・・
19940711 L N H N N N N 1 ・・・
19940909 L L H N N N N 1 ・・・
19941108 L N N N N N N 1 ・・・
19950107 L N N L N N N 1 ・・・
19950308 L N N N N N N 1 ・・・
19950507 L N H N N N N 1 ・・・
19950706 L N N L N N N 1 ・・・
19950904 L L N L N L N 1 ・・・
19951103 L L N N N N N 1 ・・・
The certainty of these patterns is ensured as, for almost patients, they appear
after the biopsy. These patterns include inspection items and their values that
are typical of cirrhosis. These patterns may appear only once or several times in
one patient. Figure 19 shows the data of a patient for whom pattern 111 exists.
As we made no attempt to estimate missing values, the pattern was not counted
even if the value of only one attribute is missing. At data in the Figure 19,
pattern 111 would have been counted four if the value of TTT SD in the fourth
line had been “1” instead of missing.
Table 5. Size of graphs (hepatitis type) Table 6. Average error rates (%)
(hepatitis type)
Fig. 20. One of the ten trees from the best run in exp.3 (ne =20)
Predicted Class
Actual decision tree in Figure 20 decision tree in Figure 21 Overall
Class Type B Type C Type B Type C Type B Type C
Type B 6 2 7 1 559 211
Type C 0 11 4 7 181 979
The overall result is summarized in Table 6. The average error rate was
23.21% for nr =20 and 20.31% for ne =20. Figure 20 and Figure 21 show samples
of decision trees from the best run with the lowest error rate (run 1) and the
worst run with the highest error rate (run 6) for ne = 20, respectively. Com-
paring these two decision trees, two patterns (shown in Figures 22 and 23) were
identical and used at the upper level nodes. There patterns also appeared at
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 145
Fig. 21. One of the ten trees from the worst run in exp.3 (ne =20)
H N
Fig. 22. Pattern 311 = Pattern 321 (if exist then Type B)
TTT_SD TTT_SD
1 1
almost all the decision trees and thus are considered sufficiently discriminative.
The contingency tables for these decision trees and the overall one are shown
in Table 7. Since the hepatitis C tends to become chronic and can eventually
lead to hepatoma, it is more valuable to classify the patient of type C correctly.
The results are favorable in this regards because the minority class is type B in
this experiment. Thus, by regarding type B as negative and type C as positive,
decision trees constructed by DT-GBI tended to have more false positive than
false negative for predicting the hepatitis type.
146 W. Geamsakul et al.
N 12 weeks N N
later 10 weeks
ALB ALB later ALB
5 H 5 H 5 H
GPT_SD D-BIL GPT_SD D-BIL GPT_SD D-BIL
2 weeks 2 weeks
later ・・・ later
GPT GOT GPT GOT GPT GOT
VH H UH VH UH VH
GOT-SD GOT-SD GOT-SD
2 4 3
Fig. 24. An example of graph structured data for the analysis of interferon therapy
label
R virus disappeared (Response)
N virus existed (Non-response)
? no clue for virus activity
R? R (not fully confirmed)
N? N (not fully confirmed)
?? missing
Table 9. Size of graphs (interferon therapy) Table 10. Average error rate (%)
(interferon therapy)
effectiveness of R N Total
interferon therapy run of ne =20
No. of graphs 38 56 94 10 CV
Avg. No. of nodes 77 74 75 1 18.75
Max. No. of nodes 123 121 123 2 23.96
Min. No. of nodes 41 33 33 3 20.83
4 20.83
5 21.88
6 22.92
7 26.04
8 23.96
9 23.96
10 22.92
Average 22.60
Standard 1.90
Deviation
Fig. 25. One of the ten trees from the best run in exp.4
Fig. 26. One of the ten trees from the worst run in exp.4
Table 11. Contingency table with the number of instances (interferon therapy)
Predicted Class
Actual decision tree in Figure25 decision tree in Figure26 Overall
Class R N R N R N
R 3 1 2 2 250 130
N 0 6 4 2 83 477
H H H
N H N
MCV N CHE
GOT
N HCT CHE 1 I-BIL VH
N N HCT
HGB ZTT_SD GPT
HCT I-BIL
N RBC TTT_SD 1 N HGB ZTT_SD 1
T-CHO MCHC
TP I-BIL RBC TTT_SD
N L
HGB RBC D-BIL N T-CHO 1
N N
N N N N
Fig. 27. Pattern 411 = Fig. 28. Pattern 412 (if Fig. 29. Pattern 422 (if
Pattern 421 (if exist then exist then R, if not exist exist then R, if not exist
R) then N) then N)
148 W. Geamsakul et al.
Pattern
431
Y N
Pattern
Response
432
Y N
Pattern
Response
433
Y N
Pattern Non-
434 response
Y N
Non-
Response
response
Fig. 30. Example of decision tree with time interval edge in exp.4
N N
N N N N
T-BIL T-BIL
N 1 N 1
ALB I-BIL ALB I-BIL
HCT ZTT_SD HCT ZTT_SD
N HGB TTT_SD 1 N HGB TTT_SD 1
MCHC MCH TP
N RBC TP 4 weeks N MCHC T-CHO N
T-CHO later RBC
N N N N
N N
H
N N
MCV
N ALB I-BIL 1
HCT ZTT_SD
N HGB TTT_SD 1
RBC TP
N T-CHO N
Fig. 32. Pattern 432 (if exist then R, if not exist then N)
nations. Response to interferon therapy was judged by a medical doctor for each
patient, which was used as the class label for interferon therapy. The class labels
specified by the doctor for interferon therapy are summarized in Table 8. Note
that the following experiments (Experiment 4) were conducted for the patients
with label R (38 patients) and N (56 patients). Medical records for other patients
were not used.
To analyze the effectiveness of interferon therapy, we hypothesized that the
amount of virus in a patient was almost stable for a certain duration just before
the interferon injection in the dataset. Data in the range of 90 days to 1 day
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 149
before the administration of interferon were extracted for each patient and av-
erage was taken for two-week interval. Furthermore, we hypothesized that each
pathological condition in the extracted data could directly affect the pathologi-
cal condition just before the administration. To represent this dependency, each
subgraph was directly linked to the last subgraph in each patient. An example
of converted graph-structured data is shown in Figure 24.
As in subsection 5.2 and subsection 5.3, feature selection was conducted to re-
duce the number of attributes. Since the objective of this analysis is to predict the
effectiveness of interferon therapy without referring to the amount of virus, the at-
tributes of antigen and antibody (HBC-AB, HBE-AB, HBE-AG, HBS-AB, HBS-
AG, HCV-AB, HCV-RNA) were not included. Thus, as in subsection 5.3 we used
the following 25 attributes: ALB, CHE, D-BIL, GOT, GOT SD, GPT, GPT SD,
HCT, HGB, I-BIL, ICG-15, MCH, MCHC, MCV, PLT, PT, RBC, T-BIL, T-CHO,
TP, TTT, TTT SD, WBC, ZTT, and ZTT SD. Table 9 shows the size of graphs
after the data conversion. The beam width b was set to 3 in experiment 4.
The results are summarized in Table 10 and the overall average error rate
was 22.60% (in this experiment we did not run the cases for nr =20). Figures 25
and 26 show examples of decision trees each from the best run with the lowest
error rate (run 1) and the worst run with the highest error rate (run 7) respec-
tively. Patterns at the upper nodes in these trees are shown in Figures 27, 28 and
29. Although the structure of decision tree in Figure 25 is simple, its prediction
accuracy was actually good (error rate=10%). Furthermore, since the pattern
shown in Figure 27 was used at the root node of many decision trees, it is con-
sidered as sufficiently discriminative for classifying patients for whom interferon
therapy was effective (with class label R).
The contingency tables for these decision trees and the overall one are shown
in Table 11. By regarding the class label R (Response) as positive and the class
label N (Non-response) as negative, the decision trees constructed by DT-GBI
tended to have more false negative for predicting the effectiveness of interferon
therapy. As in experiments 1 and 2, minority class is more difficult to predict.
The patients with class label N are mostly classified correctly as “N”, which will
contribute to reducing the fruitless interferon therapy of patients, but some of
the patients with class label R are also classified as “N”, which may lead to miss
the opportunity of curing patients with interferon therapy.
Unfortunately, only a few patterns contain time interval edges in the con-
structed decision trees, so we were unable to investigate how the change or sta-
bility of blood test will affect the effectiveness of interferon therapy. Figure 30 is
an example of decision tree with time interval edges for the analysis of interferon
therapy and some patterns in this tree are shown in Figures 31 and 32.
6 Conclusion
This paper has proposed a method called DT-GBI, which constructs a classi-
fier (decision tree) for graph-structured data by GBI. Substructures useful for
150 W. Geamsakul et al.
Acknowledgment
This work was partially supported by the grant-in-aid for scientific research 1)
on priority area “Realization of Active Mining in the Era of Information Flood”
(No. 13131101, No. 13131206) and 2) No. 14780280 funded by the Japanese
Ministry of Education, Culture, Sport, Science and Technology.
References
1. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and
Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984.
2. S. Fortin. The graph isomorphism problem, 1996.
3. T. B. Ho, T. D. Nguyen, S. Kawasaki, S.Q. Le, D. D. Nguyen, H. Yokoi, and
K. Takabayashi. Mining hepatitis data with temporal abstraction. In Proc. of the
9th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 369–377, August 2003.
Extracting Diagnostic Knowledge from Hepatitis Dataset by DT-GBI 151
1 Introduction
[11][3][4][5]. We have developed a data mining oriented CRM system that runs
on MUSASHI by integrating several marketing tools and data mining technol-
ogy. Discussing the cases regarding simple customer management in Japanese
supermarkets and drugstores, we shall describe general outlines, components,
and analytical tools for CRM system which we have developed.
With the progress of deflation in recent Japanese economy, retailers in Japan
are now under competitive environment and severe pressure. Many of these enter-
prises are now trying to encompass and maintain their loyal customers through
the introduction of FSP (Frequent Shoppers Program) [6][14]. FSP is defined
as one of the CRM systems to accomplish effective sales promotion by accu-
mulating purchase history of the customers with membership cards in its own
database and by recognizing the nature and the behavior of the loyal customers.
However, it is very rare that CRM system such as FSP has actually contributed
to successful business activities of the enterprises in recent years.
There are several reasons why the existing CRM system cannot contribute to
the acquisition of customers and to the attainment of competitive advantage in
the business. First of all, the cost to construct CRM system is very high. In fact,
some of the enterprises have actually spent a large amount of money merely for
the construction of data warehouse to accumulate purchase history data of the
customers and, as a result, no budget is left for carrying out customer analysis.
Secondly, it happens very often that although data are actually accumulated,
techniques, software and human resources in their firms to analyze these data are
in shortage, and thus the analysis of the customers is not in progress. Therefore,
in many cases, enterprises are simply accumulat-ing the data but do not carry
out the analysis of the customers.
In this paper, we shall introduce a CRM system, named C-MUSASHI, which
can be constructed at very low cost by the use of the open-source software
MUSASHI, and thus can be adopted freely even by a small enterprise. C-
MUSASHI consists of three components, basic tools for customer analysis, store
management systems and data mining oriented CRM systems. These compo-
nents have been developed through joint research activities with various types
of enterprises. With C-MUSASHI, it is possible to carry out the analysis of the
customers without investing a large amount of budget for building up a new
analytical system. In this paper we will explain the components of C-MUSASHI
and cases where C-MUSASHI is applied to a large amount of customer history
data of supermarkets and drugstores in Japan to discover useful knowledge for
marketing strategy.
2 C-MUSASHI in Retailers
MUSASHI, the Mining Utilities and System Architecture for Scalable processing
of Historical data, is a data mining platform [3][5] that efficiently and flexibly
processes large-scale data that has been described in XML data. One of its re-
154 K. Yada et al.
markable advantages lies in the powerful and flexible ability to preprocess various
amounts of raw data in the knowledge discovery process. The development of
MUSASHI has been progressed as an open source software, and thus everybody
can download it freely from [11].
MUSASHI has a set of small data processing commands designed for retriev-
ing and processing large datasets efficiently for various purposes such as data
extraction, cleaning, reporting and data mining. Such data processing can be
executed simply by running MUSASHI commands as a shell script. These com-
mands also includes various data mining commands such as sequential mining,
association rule, decision tree, graph mining and clustering commands.
MUSASHI uses XML as a data structure to integrate multiple databases, by
which various types of data can be represented. MUSASHI makes it feasible to
carry out the flexible and low-cost integration of the structured and vast business
data in companies.
XML data with minimal data loss. Figure 2 shows a sample of XML data that
is converted from electronic journal data output. It is clear that operations in
detail on POS cash registers are accumulated as XML data structure.
However, if all operation logs at POS registers are accumulated on XML data,
the amount of data may become enormous which in turn leads to the decrease of
the processing speed. In this respect, we define a table-type data structure called
XML table (see Figure 3). It is easy for users to transform XML data into XML
table data by using ”xml2xt” command in MUSASHI. The XML table is an
XML document such that the root element named <xmltbl> has two elements,
<header> and <body>. The table data is described in the body element using
a very simple text format with one record on each line, and each field within
that record separated by a comma. Names and positional information relating to
each of the fields are described in the <field> element. Therefore it is possible to
access data via field names. The data title and comments are displayed in their
respective <title> and <comment> fields. A system is built up by combining
XML data such as operation logs with XML data and XML table data. Thus,
by properly using pure XML and XML table depending on the purposes, it is
to construct an efficient system with high degree of freedom.
156 K. Yada et al.
Fig. 3. Sales data which has been converted into an XML table
In this section, we shall introduce the tools for basic customer analysis in C-
MUSASHI. Such tools are usually incorporated in existing CRM systems. Here
we will explain some of such tools: decile analysis, RFM analysis, customer
attrition analysis, and LTV (life time value) measurement. C-MUSASHI also
has many other tools for customer analysis. We will present here only a part of
them here.
Data Mining Oriented CRM Systems Based on MUSASHI: C-MUSASHI 157
In decile analysis, based on the ranking of the customers derived from the amount
of purchase, customers are divided into ten groups of equal size, and then basic
indices such as average amount of purchase, number of visits to the store, etc.
are computed for each group [6][14]. From this report, it can be understood that
all customers do not have an equal value for the store, but only a small fraction
of the customers contribute to most of the profit in the store.
RFM analysis [14][8] is one of the tools most frequently used in the application
purpose such as direct-mail marketing. The customers are classified according to
three factors, i.e. recency of the last date of purchase, frequency of purchase, and
monetary factor (purchase amount). Based on this classification, adequate sales
promotion is executed for each customer group. For instance, in a supermarket, if
a customer had the highest purchase frequency and the highest purchase amount,
and did not visit to the store within one month, sufficient efforts must be made
to bring back this customer from the stores of the competitors.
LTV is a net present value of the profit which an average customer in a certain
customer group brings to a store (an enterprise) within a given period [8][1].
It is calculated from the data such as sales amount of the customer group,
customer maintaining rate, and discount rate such as the rate of interest on a
national bond. Long-term customer strategy should be set up based on LTV,
and it is an important factor relating to CRM system. However, the component
for calculation of LTV prepared in C-MUSASHI is currently very simple and it
must be customized depending on enterprises to use it.
These four tools are minimally required as well as very important for CRM
in business field. It is possible to set up various types of marketing strategies
based on the results of analysis. However, they are general and conventional, and
then do not necessarily bring new knowledge to support differentiation strategy
of the enterprise.
158 K. Yada et al.
consume large amounts of cash because of high growth market. ”Cash cows” with
high market share and low growth of market, exhibit the excess cash. ”Question
marks” which have low market share and high market growth consume large
amounts of cash in order to gain market share and to become stars. ”Dogs” with
low market share and slow growth neither generate nor consume a large amount
of cash. In PPM the strategy of each SBU is determined by whether its market
share is high or low, and its ratio of market growth is high or not.
Based on the idea of the above PPM, we propose Store Portfolio Management
(SPM) which is a methodology to support for strategic planning of chain stores
management. The SPM provides useful information of each store such as store s
profitability and effectiveness of store sales promotion from the viewpoints of the
headquarter of an overall chain store. It is impossible to completely understand
each situation of many stores and, even if it is possible, it is impossible to imple-
ment different marketing strategy corresponding to each store. The purpose of
SPM is to plan different marketing strategy of each store cluster which has faced
with the similar situation and markets. Here we the situations and markets of
stores are measured by using various di-mensions.
Various kinds of evaluation dimensions can be employed in SPM such as
profitability, sales amount, the number of visiting customers and an average of
sales amount per customer (See Table 1). Users can select appropriate dimensions
corresponding to their situations and industry. In SPM each evaluation criterion
has been associated with each marketing strategy. For example, if a store is
160 K. Yada et al.
It is difficult for the marketing stuff of a chain store to categorize their stores
into a number of clusters by using all the above dimensions because of low
interpretability of the resulting clusters. According to our experience, it seems
to be appropriate way to use two or three dimensions to clustering stores into
some meaningful groups.
drugstore chain, we found that the reason of the low profitability in these drug-
stores is ineffectiveness of sales promotion in each store. We then determined
which product categories have to be improved for the effectiveness of sales pro-
motion.One of the modules caluculates the number of days for each product
category that the sales amount when sold at a discounted price is less than the
average sales amount at regular price. Such promotion sales is called ”waste sales
promotion”. Figure 7 shows the results obtained for three stores S1 , S2 , S3 . It is
observed from the figure that the number of waste sales promotions in store A s
soap category (See figure 7) is larger than that of other shops and the number of
waste sales promotions in mouth care category of store B is large. The marketing
stuff in the headquarters of drugstore can plan different sales strategy for each
store to focus on the specific categories depending on these data.
Fig. 7. The number of products with waste sales promotions of each store
ment systems. Currently these store management systems can deal with the data
of less than 100 stores and be applied to supermarket and drugstore chain.
In this section data mining oriented CRM system will be presented, which dis-
covers new knowledge useful for implementing effective CRM strategy from the
purchase data of the customers. General CRM systems commercially available
simply comprise the functions of retrieval and aggregation as basic tools of C-
MUSASHI, and there are very few CRM systems in which an analytical system
that can deal with large-scale data equipped with data mining engine is avail-
able. In this section, we explain our system that can discover useful customer
knowledge by integrating the data mining technique with CRM system.
Decile switch analysis module is a system to find out what kind of changes of
purchase behavior of each of ten customer groups computed by decile analysis
gives strong influence on the sales of the store. Given two periods, the following
processing will be automatically started: First the customers who visited the
store in the first period are classified into ten groups by decile analysis. The
customers in each group are then classified into 10 groups again according to the
decile analysis in the second period. When a customer did not visit the store in
the second period, he/she does not belong to any decile group and thus classified
as another group. Thus, the customers are classified into 11 groups. Since the
customers are classified into 10 groups in the first stage, they are classified into
110 groups in total.
For each of these customer groups, the difference between the purchase
amount in the preceding period and that in the subsequent period is calcu-
lated. We then compute the changes of sales amount of which customer groups
give strong influence on total sales amount of the store.
Data Mining Oriented CRM Systems Based on MUSASHI: C-MUSASHI 167
Next, judging from the above numerical values (influence on total sales
amount of the store), the user decides which of the following data he/she wants
to see, e.g., decile switch of (a) all customers, (b) loyal customers of the store,
or (c) quasi-loyal customers (See Figure 11). If the user wants to see the decile
switch of quasi-loyal customers, sales ratio of each product category for the
relevant customer group in the preceding period is calculated, and a decision
tree is generated, which shows the difference in the purchased categories be-
tween the quasi-loyal customers whose decile increased in the subsequent period
(decile-up) and those whose decile value decreased (decile-down). Based on the
rules obtained from the decision tree, the user can judge which product category
should be recommended to quasi-loyal customers in order to increase the total
sales of the store in future.
had been purchasing, the store manager can obtain useful information on the
improvement of merchandise lineup at the store to keep loyal customers.
Four modules explained above are briefly summarized in Table 2.
Table 2. The summary of four modules in data mining oriented CRM systems
In this section we will discuss the cases of C-MUSASHI in real business world.
Since we cannot deal with all of the cases of the four modules in this paper, we
will introduce the cases of decile switch analysis and customer attrition analysis
modules in a large-scale supermarket.
categorized in each decile switch group. There are many customers staying in
decile 1 in both April and May, who are the most important loyal customers
to the store. However they do not have so strong effects on sales of this store
though the number of them is larger than that of the other decile switch groups.
Figure 13 shows the changes of the purchase amounts in April and May of the
customer groups classified according to the decile values of both periods. In the
figure, the portion indicated by a circle shows that the sales for the quasi-loyal
customer groups (the customers with decile 2-4) in April increases in May. From
the figure, it is easy to observe that the sales increase of quasi-loyal customers
makes great contribution to the increase of the total sales amount of the store.
Next, focusing on the quasi-loyal customers based on the above information,
we carried out decile switch analysis by using decision tree to classify them into
decile-up or decile-down customer groups. In the rules obtained from the decision
tree, we found some interesting facts (See Figure 14). For instance, it was found
that the customer who had exhibited higher purchase percentage of the product
category such as milk, eggs, yoghurt, etc., which are easily perishable, shows
high purchase amount in the subsequent period. Also, it was discovered that the
customers who had been purchasing drugs such as medicine for colds or headache
exhibited the increase in decile value in the subsequent period.
The store manager interpreted these rules as follows: If a customer is inclined
to purchase daily foodstuffs at a store, total purchase amount of the customer
170 K. Yada et al.
Fig. 14. The rules obtained from the decision tree by using decile switch analysis
module
Fig. 15. The definition of loyal customer by using frequency and monetary dimensions
A user can define the loyal customer segments by using frequency and mon-
etary dimensions (balck sells in Figure 15 illustrates an example of the loyal
customer segments). After defining them, two customer groups are extracted;
the first group called ”fixed loyal customers” is defined as those who had been
loyal customers continuously for the past four months in the target periods and
the second called ”attrition customers” is defined as those who had been loyal
customers continuously for first three months and had gone to the other stores at
the final month. We used four months of purchase data from Sep. 2002 through
Dec. 2002 of a supermarket in Japan. The number of the fixed loyal customers
is 1918, and that of attrition customers is 191. Using the sales ratios of rough
product categories as explanatory variables, a classification model is generated
which characterizes the distinction between the attrition customers and the fixed
customers.
172 K. Yada et al.
Figure 16 shows a part of results extracted from the derived decision tree.
In this figure it was observed that the customer whose ratio of fruit category
in his/her purchases had not been exceeding 5 percents and that of meat cat-
egory had not been exceeding 15 percents, became an attrition customer after
one month at the ratio of about 66 percents (44/67). Executing additional anal-
ysis, the marketing stuff in the store interpreted these rules as follows: The
customers who have children have a tendency to leave the store. Depending on
these findings, the new sales promotion strategy for a family with children has
been planned in the store.
Fig. 16. A part of results extracted from purchase data of target customer groups
7 Conclusion
References
1. Blattberg, R. C., Getz, G., Thomas, J. S.: Customer Equity: Building and Man-
aging Relationships as Valuable Assets. Harvard Business School Press. (2001)
2. Fujisawa, K., Hamuro, Y., Katoh, N., Tokuyama, T. and Yada, K.: Approximation
of Optimal Two-Dimensional Association Rules for Categorical Attributes Using
Semidefinite Programming. Lecture Notes in Artificial Intelligence 1721. Proceed-
ings of First International Conference DS’99. (1999) 148–159
3. Hamuro, Y., Katoh, N. and Yada, K.: Data Mining oriented System for Business
Applications. Lecture Notes in Artificial Intelligence 1532. Proceedings of First
International Conference DS’98. (1998) 441–442
4. Hamuro, Y., Katoh, N., Matsuda, Y. and Yada, K.: Mining Pharmacy Data Helps
to Make Profits. Data Mining and Knowledge Discovery. 2 (1998) 391–398
5. Hamuro, Y., Katoh, N. and Yada, K.: MUSASHI: Flexible and Efficient Data Pre-
processing Tool for KDD based on XML. DCAP2002 Workshop held in conjunction
with ICDM2002. (2002) 38–49
6. Hawkins, G. E.: Building the Customer Specific Retail Enterprise. Breezy Heights
Publishing. (1999)
7. Hedley, B.: Strategy and the business portfolio. Long Range Planning. 10 1 (1977)
9–15
8. Hughes, A. M.: Strategic Database Marketing. The McGraw-Hill. (1994)
9. Ip, E., Johnson, J., Yada, K., Hamuro, Y., Katoh, N. and Cheung, S.: A Neural
Network Application to Identify High-Value Customer for a Large Retail Store
in Japan. Neural Networks in Business: Techniques and Applications. Idea Group
Publishing. (2002) 55–69
10. Ip, E., Yada, K., Hamuro, Y. and Katoh, N.: A Data Mining System for Man-
aging Customer Relationship. Proceedings of the 2000 Americas Conference on
Information Systems. (2000) 101–105
11. MUSASHI: Mining Utilities and System Architecture for Scalable processing of
Historical data. URL: http://musashi.sourceforge.jp/.
12. Reed, T.: Measure Your Customer Lifecycle. DM News 21. 33 (1999) 23
13. Rongstad, N.: Find Out How to Stop Customers from Leaving. Target Marketing
22. 7 (1999) 28–29
14. Woolf, B. P.: Customer Specific Marketing. Teal Books. (1993)
Investigation of Rule Interestingness
in Medical Data Mining
1 Introduction
Medical data mining is one of active research fields in Knowledge Discovery in
Databases (KDD) due to its scientific and social contribution. We have been
conducted case studies to discovery new knowledge on the symptom of hepatitis
which is a progressive liver disease, since grasping the symptom of hepatitis is
essential for its medical treatment. We have repeated obtaining rules to pre-
dict prognosis from a clinical dataset and their evaluation by a medical expert,
improving the mining system and conditions. In this iterative process, we rec-
ognized the significance of system-human interaction to enhance rule quality by
reflecting the domain knowledge and the requirement of a medical expert. We
2 Related Work
2.1 Our Previous Research on Medical Data Mining
We have conducted case studies [1] to discover the rules predicting prognosis
based on diagnosis in a dataset of the medical test results on viral chronic hep-
atitis. The dataset was open as the common one of data mining contests [8, 9].
We repeated the set of rule generation by our mining system and rule evaluation
by a medical expert two times, and the repetition made us discover the rules
highly valued by the medical expert. We finely pre-processed the dataset based
on medical expert’s advice, since such a real medical dataset is ill-defined and has
many noises and missing values. We then performed a popular time-series min-
ing technique, which extracts representative temporal patterns from a dataset
by clustering and learns rules consisting of extracted patterns by a decision tree
[11]. we did not treat the subsequence extraction with a sliding window used in
the literature [11] to avoid STS clustering problem pointed out in the literature
[12]. We simply extracted subsequences at the starting point of medical tests
from the dataset of medical test results. In the first mining, we used EM algo-
rithm [2] for clustering and C4.5 [3] for learning. In the second mining, we used
K-means algorithm [4] for clustering to remove the problem that EM algorithm
is not easily understandable and adjustable for medical experts.
The upper graph in Fig. 1 shows one of the rules, which the medical expert
focused on, in the first mining. It estimates the future trend of GPT, one of major
medical tests to grasp hepatitis symptom, in the future one year by using the
176 M. Ohsaki et al.
Fig. 1. Highly valued rules in the first (upper) and the second mining (lower)
change of several medical test results in the past two years. The medical expert
commented on it as follows: the rule offers a hypothesis that GPT value changes
with an about three-years cyclic, and the hypothesis is interesting since it differs
from the conventional common sense of medical experts that GPT value basically
decreases in a monotone. We then improved our mining system, extended the
observation term, and obtained new rules. The lower graph in Fig. 1 shows one
of the rules, which the medical expert highly valued, in our second mining. The
medical expert commented on it that it implies GPT value globally changes two
times in the past five years and more strongly supports the hypothesis of GPT’s
cyclic change.
In addition to that, we obtained the evaluation results of rules by the medical
expert. We removed obviously meaningless rules in which the medical test results
stay in their normal ranges and presented remaining rules to the medical expert.
He conducted the following evaluation tasks: For each mining, he checked all pre-
sented rules and gave each of them the comment on its medical interpretation
Investigation of Rule Interestingness in Medical Data Mining 177
Fig. 2. Conceptual model of interaction between a mining system and a medical expert
and one of rule quality labels. The rule quality labels were Especially-Interesting
(EI), Interesting (I), Not-Understandable (NU), and Not-Interesting (NI). EI
means that the rule was a key to generate or confirm a hypothesis. As a conse-
quence, we obtained a set of rules and their evaluation results by the medical
expert in the first mining and that in the second mining. Three and nine rules
received EI and I in the first mining, respectively. Similarly, two and six rules
did in the second mining.
After we achieved some positive results, we tried to systematize the know-
how and methodology obtained through the repetition of mining and evaluating
process on the system-human interaction to polish up rules in medical domain.
We made a conceptual model of interaction that describes the relation and media
between a mining system and a medical expert (See Fig. 2). It also describes the
ability and rule of a mining system and those of a medical expert.
A mining system learns hidden rules faithfully to the data structure and
offers them to a medical expert as the hints and materials for hypothesis gen-
eration and confirmation (Note that the word ’confirmation’ in this paper does
not mean the highly reliable proof of a hypothesis by additional medical exper-
iments under strictly controlled conditions. It means the additional information
extraction from the same data to enhance the reliability of an initial hypothe-
sis). While, a medical expert generates and confirms a hypothesis, namely a seed
of new knowledge, by evaluating the rules based on his/her domain knowledge.
A framework to support KDD through such human-system interaction should
well-balancedly reflect the both of objective criteria (the mathematical features
of data) and subjective criteria (domain knowledge, interest, and focus point of
a medical expert) to each process of KDD, namely pre-processing, mining, and
post-processing. It is important not to mix up the objective and subjective cri-
teria to avoid too much biased unreliable results; The framework should explicit
notify a mining system and a medical expert which criterion or which combina-
tion of criteria is used now. Following these backgrounds, this research focuses
178 M. Ohsaki et al.
The interest which a human user really feels for a rule in his/her mind. It is
formed by the synthesis of cognition, domain knowledge, individual experiences,
and the influences of the rules that he/she evaluated before.
This research specifically focuses on objective measures and investigates the
relation between them and real human interest. We then explain the details
of objective measures here. They can be categorized into some groups with
the criterion and theory for evaluation. Although the criterion is absolute or
relative as shown in Table 1, the majority of present objective measures are
based on an absolute criterion. There are several kinds of criterion based on
the following factors: Correctness – How many instances the antecedent and/or
consequent of a rule support, or how strong their dependence is [13, 14, 20, 23],
Generality – How similar the trend of a rule is to that of all data [18] or the
other rules, Uniqueness – How different the trend of a rule is from that of all
data [17, 21, 24] or the other rules [18, 20], and Information Richness – How much
information a rule possesses [15]. These factors naturally prescribe the theory
for evaluation and the interestingness calculation method based on the theory.
The theory includes the number of instances [13], probability [19, 21], statistics
[20, 23], information [14, 23], the distance of rules or attributes [17, 18, 24], and
the complexity of a rule [15] (See Table 1).
We selected the basic objective measures shown in Table 2 for the experi-
ment in Section 3. Some objective measures in Table 2 are expediently written
in abbreviations: GOI means “Gray and Orlowska’s Interestingness” [19], and
we call GOI with the dependency coefficient value at the double of the general-
ity one GOI-D, and vice versa for GOI-G. χ2 -M means “χ2 Measure using all
four quadrants, A → C, A → C, A → C, and A → C” [20]. J-M and K-M
mean “J Measure” [14] and “our original measure based on J-M” (we used ’K’
for the name of this measure, since ’K’ is the next alphabet to ’J’). PSI means
“Piatetsky-Shapiro’s Interestingness” [13]. Although we actually used other ob-
jective measures on reliable exceptions [33, 34, 35], they gave all rules the same
lowest evaluation values due to the mismatch of their and our evaluation objects.
This paper then does not mention their experimental results.
Now we explain the motivation of this research in detail. Objective mea-
sures are useful to automatically remove obviously meaningless rules. However,
some factors of evaluation criterion have contradiction to each other such as
generality and uniqueness and may not match with or contradict to real human
interest. In a sense, it may be proper not to investigate the relation between
objective measures and real human interest, since their evaluation criterion does
not include the knowledge on rule semantics and are obviously not the same of
real human interest. However, our idea is that they may be useful to support
KDD through human-system interaction if they possess a certain level of per-
formance to detect really interesting rules. In addition to that, they may offer
a human user unexpected new viewpoints. Although the validity of objective
measures has been theoretically proven and/or experimentally discussed using
some benchmark data [5, 6, 7], very few attempts have been made to investigate
their comparative performance and the relation between them and real human
interest for a real application. Our investigation will be novel in this light.
Investigation of Rule Interestingness in Medical Data Mining 181
Table 3. The total rank of objective measures through first and second mining
Fig. 3. The evaluation results by a medical expert and objective measures for the
rules in first mining. Each column represents a rule, and each row represents the set
of evaluation results by an objective measure. The rules are sorted in the descending
order of the evaluation values given by the medical expert. The objective measures
are sorted in the descending order of the meta criterion values. A square in the left-
hand side surrounds the rules labeled with EI or I by the medical expert. White and
black cells mean that the evaluation by an objective measure was and was not the
same by the medical expert, respectively. The five columns in the right side show the
performance on the four comprehensive criteria and the meta one. ’+’ means the value
is greater than that of random selection
Fig. 4. The evaluation results by a medical expert and objective measures for the
rules in second mining. See the caption of Fig. 3 for the details
Investigation of Rule Interestingness in Medical Data Mining 183
itative analysis, we visualized their degree of agreement to easily grasp its trend.
We colored the rules with agreement white and those with disagreement black.
The pattern of white and black cells for an objective measure describes how its
evaluation matched with those by the medical expert. The more the number
of white cells in the left-hand side, the better its performance to estimate real
human interest. For the quantitative analysis, we defined four comprehensive cri-
teria to evaluate the performance of an objective measure. #1: Performance on
I (the number of rules labeled with I by the objective measure over that by the
medical expert. Note that I includes EI). #2: Performance on EI (the number of
rules labeled with EI by the objective measure over that by the medical expert).
#3: Number-based performance on all evaluation (the number of rules with the
same evaluation results by the objective measure and the medical expert over
that of all rules). #4: Correlation-based performance on all evaluation (the cor-
relation coefficient between the evaluation results by the objective measure and
those by the medical expert). The values of these criteria are shown in the right
side of Fig. 3 and 4. The symbol ’+’ besides a value means that the value is
greater than that in case rules are randomly selected as EI or I. Therefore, an
objective measure with ’+’ has higher performance than random selection does
at least. To know the total performance, we defined the weighted average of the
four criteria as a meta criterion; we assigned 0.4, 0.1, 0.4, and 0.1 to #1, #2,
#3, and #4, respectively, according to their importance. The objective measures
were sorted in the descending order of the values of meta criterion.
The results in the first mining in Fig. 3 show that Recall demonstrated the
highest performance, χ2 -M did the second highest, and J-M did the third high-
est. Prevalence demonstrated the lowest performance, Coverage did the second
lowest, and Credibility and GOI-G did the third lowest. The results in the second
mining in Fig. 4 show that Credibility demonstrated the highest performance,
Accuracy did the second highest, and Lift did the third highest. Prevalence
demonstrated the lowest performance, Specificity did the second lowest, and
Precision did the third lowest. To understand the whole trend, we calculated
the total rank of objective measures through first and second mining and sum-
marized it in Table 3. We averaged the value of meta criterion in first mining
and that in second mining for each objective measure and sorted the objective
measures with the averages. Although we carefully defined the meta criterion, it
is hard to say that the meta criterion perfectly expresses the performance of ob-
jective measures. We then also did the same calculation on the ranks in first and
second mining for multidirectional discussions. As shown in Table 3, in either
case of averaged meta criterion or averaged rank, χ2 -M, Recall, and Accuracy
maintained their high performance through the first and second mining. On the
other hand, Prevalence and Specificity maintained their low performance. The
other objective measures slightly changed their middle performance.
Some objective measures – χ2 -M, Recall, and Accuracy – showed constantly
high performance; They had comparatively many white cells and ’+’ for all
comprehensive criteria. The results and the medical expert’s comments on them
imply that his interest consisted of not only the medical semantics but also
184 M. Ohsaki et al.
the statistical characteristics of rules. The patterns of white and black cells are
mosaic-like in Fig. 3 and 4. They imply that the objective measures had almost
complementary relationship for each other and that the combinational use of
objective measures may achieve a higher performance. For example, the logical
addition of Recall and χ2 -M in Fig. 3 increases the value of correct answer rate
#3 from 22/30 = 73.3% to 29/30 = 96.7%. Now we summarize the experimental
results as follows: some objective measures will work at a certain level in spite
of no consideration of domain semantics, and the combinational use of objective
measures will work better. Although the experimental results are specific to
the experimental conditions, they gave us the possibility to utilize objective
measures that have not been given by the conventional theoretical studies on
objective measures.
learned with a conventional learning algorithm using the evaluation results by all
objective measures in the repository as inputs and those by the human user as
outputs. The most simple method is to formulate a function, y = f (x1 , x2 , ..., xn ),
where y is a evaluation result by a human user, xi is that by i-th objective mea-
sure, and n is the number of objective measures. We can realize it with linear
regression [37] by regarding the function as a summation of weighted values of
objective measures. We also can do that with neural network by regarding the
function as a mapping among the evaluation results – Three-layered backprop-
agation neural network [36], which is the most popular one, will work well –.
Other method is to organize a tree structure expressing a relation among the
evaluation results with C4.5 [3].
The conceivable bottleneck is that the modeling by an evaluation learning
algorithm does not succeed due to the inadequate quality and quantity of eval-
186 M. Ohsaki et al.
uation results by a human user. Especially, the problem of their quantity may
be serious; The number of evaluation results by a human user may be too small
for machine learning. This kind of problems frequently appears in the intelligent
systems including human factors such as interactive evolutionary computation
[38, 32]. However, our purpose to use an evaluation learning algorithm is not
highly precise modeling but roughly approximative modeling to support the
thinking process of a human user through human-system interaction. Therefore,
we judged that the modeling by an evaluation learning algorithm is worth trying.
In our latest research, we actually made the models with C4.5, neural net-
work, and linear regression. We briefly report their performance estimated in the
latest research using the rules and their evaluation results obtained in our past
research [1] that was introduced in Section 2.1. As shown in Table 4, their per-
formance in the first mining, which was the phase of hypothesis generation, was
considerably low due to the fluctuation of evaluation by a human user. On the
other, C4.5 and linear regression outperformed an objective measure with the
highest performance for #1 and #2 in the second mining, which was the phase
of hypothesis confirmation. These results indicates that although the modeling
will not work well in the phase of hypothesis generation, that by C4.5 and linear
regression as evaluation learning algorithms will work to a certain level in the
phase of hypothesis confirmation.
References
1. Ohsaki, M., Sato, Y., Yokoi, H., Yamaguchi, T.: A Rule Discovery Support Sys-
tem for Sequential Medical Data, – In the Case Study of a Chronic Hepatitis
Dataset –. Proceedings of International Workshop on Active Mining AM-2002 in
IEEE International Conference on Data Mining ICDM-2002 (2002) 97–102
2. Dempster, A. P., Laird,N. M., and Rubin, D. B.: Maximum Likelihood from In-
complete Data via the EM Algorithm. Journal of the Royal Statistical Society,
vol.39, (1977) 1–38.
3. Quinlan, J. R.: C4.5 – Program for Machine Learning –, Morgan Kaufmann (1993).
4. MacQueen, J. B.: Some Methods for Classification and Analysis of Multivariate
Observations. Proceedings of Berkeley Symposium on Mathematical Statistics and
Probability, vol.1 (1967) 281–297.
5. Yao, Y. Y. Zhong, N.: An Analysis of Quantitative Measures Associated with Rules.
Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining
PAKDD-1999 (1999) 479–488
6. Hilderman, R. J., Hamilton, H. J.: Knowledge Discovery and Measure of Interest.
Kluwer Academic Publishers (2001)
188 M. Ohsaki et al.
7. Tan, P. N., Kumar V., Srivastava, J.: Selecting the Right Interestingness Measure
for Association Patterns. Proceedings of International Conference on Knowledge
Discovery and Data Mining KDD-2002 (2002) 32–41
8. Hepatitis Dataset for Discovery Challenge. in Web Page of European Conference on
Principles and Practice of Knowledge Discovery in Databases PKDD-2002 (2002)
http://lisp.vse.cz/challenge/ecmlpkdd2002/
9. Hepatitis Dataset for Discovery Challenge. European Conf. on Principles and Prac-
tice of Knowledge Discovery in Databases (PKDD’03), Cavtat-Dubrovnik, Croatia
(2003) http://lisp.vse.cz/challenge/ecmlpkdd2003/
10. Motoda, H. (eds.): Active Mining, IOS Press, Amsterdam, Holland (2002).
11. Das, G., King-Ip, L., Heikki, M., Renganathan, G., Smyth, P.: Rule Discovery from
Time Series. Proceedings of International Conference on Knowledge Discovery and
Data Mining KDD-1998 (1998) 16–22
12. Lin, J., Keogh, E., Truppel, W.: (Not) Finding Rules in Time Series: A Surpris-
ing Result with Implications for Previous and Future Research. Proceedings of
International Conference on Artificial Intelligence IC-AI-2003 (2003) 55–61
13. Piatetsky-Shapiro, G.: Discovery, Analysis and Presentation of Strong Rules. in
Piatetsky-Shapiro, G., Frawley, W. J. (eds.): Knowledge Discovery in Databases.
AAAI/MIT Press (1991) 229–248
14. Smyth, P., Goodman, R. M.: Rule Induction using Information Theory. in
Piatetsky-Shapiro, G., Frawley, W. J. (eds.): Knowledge Discovery in Databases.
AAAI/MIT Press (1991) 159–176
15. Hamilton, H. J., Fudger, D. F.: Estimating DBLearn’s Potential for Knowledge
Discovery in Databases. Computational Intelligence, 11, 2 (1995) 280–296
16. Hamilton, H. J., Shan, N., Ziarko, W.: Machine Learning of Credible Classifi-
cations. Proceedings of Australian Conference on Artificial Intelligence AI-1997
(1997) 330–339
17. Dong, G., Li, J.: Interestingness of Discovered Association Rules in Terms of
Neighborhood-Based Unexpectedness. Proceedings of Pacific-Asia Conference on
Knowledge Discovery and Data Mining PAKDD-1998 (1998) 72–86
18. Gago, P., Bento, C.: A Metric for Selection of the Most Promising Rules. Pro-
ceedings of European Conference on the Principles of Data Mining and Knowledge
Discovery PKDD-1998 (1998) 19–27
19. Gray, B., Orlowska, M. E.: CCAIIA: Clustering Categorical Attributes into In-
teresting Association Rules. Proceedings of Pacific-Asia Conference on Knowledge
Discovery and Data Mining PAKDD-1998 (1998) 132–143
20. Morimoto, Y., Fukuda, T., Matsuzawa, H., Tokuyama, T., Yoda, K.: Algorithms for
Mining Association Rules for Binary Segmentations of Huge Categorical Databases.
Proceedings of International Conference on Very Large Databases VLDB-1998
(1998) 380–391
21. Freitas, A. A.: On Rule Interestingness Measures. Knowledge-Based Systems,
vol.12, no.5 and 6 (1999) 309–315
22. Liu, H., Lu, H., Feng, L., Hussain, F.: Efficient Search of Reliable Exceptions.
Proceedings of Pacific-Asia Conference on Knowledge Discovery and Data Mining
PAKDD-1999 (1999) 194–203
23. Jaroszewicz, S., Simovici, D. A.: A General Measure of Rule Interestingness. Pro-
ceedings of European Conference on Principles of Data Mining and Knowledge
Discovery PKDD-2001 (2001) 253–265
24. Zhong, N., Yao, Y. Y., Ohshima, M.: Peculiarity Oriented Multi-Database Mining.
IEEE Transaction on Knowledge and Data Engineering, 15, 4 (2003) 952–960
Investigation of Rule Interestingness in Medical Data Mining 189
25. Klementtinen, M., Mannila, H., Ronkainen, P., Toivone, H., Verkamo, A. I.: Find-
ing Interesting Rules from Large Sets of Discovered Association Rules. Proceedings
of International Conference on Information and Knowledge Management CIKM-
1994 (1994) 401–407
26. Kamber, M., Shinghal, R.: Evaluating the Interestingness of Characteristic Rules.
Proceedings of International Conference on Knowledge Discovery and Data Mining
KDD-1996 (1996) 263–266
27. Liu, B., Hsu, W., Chen, S., Mia, Y.: Analyzing the Subjective Interestingness of
Association Rules. Intelligent Systems, 15, 5 (2000) 47–55
28. Liu, B., Hsu, W., Mia, Y.: Identifying Non-Actionable Association Rules. Pro-
ceedings of International Conference on Knowledge Discovery and Data Mining
KDD-2001 (2001) 329–334
29. Padmanabhan, B., Tuzhilin, A.: A Belief-Driven Method for Discovering Unex-
pected Patterns. Proceedings of International Conference on Knowledge Discovery
and Data Mining KDD-1998 (1998) 94–100
30. Sahara, S.: On Incorporating Subjective Interestingness into the Mining Process.
Proceedings of IEEE International Conference on Data Mining ICDM-2002 (2002)
681–684
31. Silberschatz, A., Tuzhilin, A.: On Subjective Measures of Interestingness in Knowl-
edge Discovery. Proceedings of International Conference on Knowledge Discovery
and Data Mining KDD-1995 (1995) 275–281
32. Terano, T., Inada, M.: Data Mining from Clinical Data using Interactive Evolu-
tionary Computation. in Ghosh, A., Tsutsui, S. (eds.): Advances in Evolutionary
Computing. Springer (2003) 847–862
33. Suzuki, E. and Shimura M.: Exceptional Knowledge Discovery in Databases Based
on an Information-Theoretic Approach. Journal of Japanese Society for Artificial
Intelligence, vol.12, no.2 (1997) pp.305–312 (in Japanese).
34. Hussain, F., Liu, H., and Lu, H.: Relative Measure for Mining Interesting Rules.
Proceedings of Workshop (Knowledge Management: Theory and Applications)
in European Conference on Principles and Practice of Knowledge Discovery in
Databases PKDD-2000 (2000).
35. Suzuki, E.: Mining Financial Data with Scheduled Discovery of Exception Rules.
Proceedings of Discovery Challenge in 4th European Conference on Principles and
Practice of Knowledge Discovery in Databases PKDD-2000 (2000).
36. Werbos, P. J.: The Roots of Backpropagation. Wiley-Interscience (1974/1994).
37. Fox, J.: Applied Regression Analysis, Linear Models, and Related Methods. Sage
Publications (1997).
38. Takagi, H.: Interactive Evolutionary Computation: Fusion of the Capacities of EC
Optimization and Human Evaluation. Proceedings of the IEEE, vol.89, no.9 (2001)
1275–1296.
39. H. Abe and T. Yamaguchi: Constructing Inductive Applications by Meta-Learning
with Method Repositories, Progress in Discovery Science, LNAI2281 (2002) 576–
585.
40. H. Abe and T. Yamaguchi: CAMLET,
http://panda.cs.inf.shizuoka.ac.jp/japanese/study/KDD/camlet/
Experimental Evaluation of
Time-Series Decision Tree
1 Introduction
Time-series data are employed in various domains including politics, economics,
science, industry, agriculture, and medicine. Classification of time-series data
is related to many promising application problems. For instance, an accurate
classifier for liver cirrhosis from time-series data of medical tests might replace
a biopsy which picks liver tissue by inserting an instrument directly into liver.
Such a classifier is highly important since it would substantially reduce costs of
both patients and hospitals.
Conventional classification methods for time-series data can be classified into
a transformation approach and a direct approach. The former maps a time se-
quence to another representation. The latter, on the other hand, typically relies
on a dissimilarity measure between a pair of time sequences. They are further
divided into those which handle time sequences that exist in data [6] and those
which rely on abstracted patterns [3, 4, 7].
Comprehensiveness of a classifier is highly important in various domains in-
cluding medicine. The direct approach, which explicitly handles real time se-
quences, has an advantage over other approaches. In our chronic hepatitis do-
main, we have found that physicians tend to prefer real time sequences instead
of abstracted time sequences which can be meaningless. However, conventional
methods such as [6] rely on sampling and problems related with extensive com-
putation in such methods still remained unknown.
In [15], we have proposed, for decision-tree induction, a split test which finds
the “best” time sequence that exists in data with exhaustive search. Our time-
series decision tree represents a novel classifier for time-series classification. Our
learning method for the time-series decision tree has enabled us to discover
a classifier which is highly appraised by domain experts [15]. In this paper,
we perform extensive experiments based on advice from domain experts, and
investigate various characteristics of our time-series decision tree.
1
We can show Euclidean distance instead of Manhattan distance. Experiments, how-
ever, showed that there is no clear winner in accuracy.
192 Y. Yamada et al.
time-series
GPT ALB PLT class
attributes
200 6 400
examples 300
5
100 200
84 4 non-LC
100
0 3 0
-500 -250 0 250 500 -500 -250 0 250 500 -500 -250 0 250 500
200 6 400
300
5
100 200
85 4 non-LC
100
0 3 0
-500 -250 0 250 500 -500 -250 0 250 500 -500 -250 0 250 500
200 6 400
300
5
100 200
930 4
LC
100
0 3 0
-500 -250 0 250 500 -500 -250 0 250 500 -500 -250 0 250 500
correspondence in correspondence in
Manhattan distance the DTW-based measure
Fig. 2. Correspondence of a pair of time sequences in the Manhattan distance and the
DTW-based measure
with different numbers of values, but also fits human intuition. Figure 2 shows
examples of correspondence in which the Euclidean distance and the dissimilarity
measure based on DTW are employed. From the Figure, we see that the right-
hand side seems more natural than the other.
Now we define the DTW-based measure G(A, B) between a pair of time
sequences A = α1 , α2 , · · · , αI and B = β1 , β2 , · · · , βJ . The correspondence be-
tween A and B is called a warping path, and can be represented as a sequence
of grids F = f1 , f2 , . . . , fK on an I × J plane as shown in Figure 3.
Experimental Evaluation of Time-Series Decision Tree 193
A
f1 = (1,1)
0 50
b1
0
Adjustment
Window
bj
f k= (i k, jk )
Warping
Path
Adjustment
Window
B
fK=(I, J) a bJ
50
a1 ai I
Let the distance between two values αik and βjk be d(fk ) = |αik − βjk |,
K
then an evaluation function ∆(F) is given by ∆(F) = 1/(I + J) k=1 d(fk )wk .
The smaller the value of ∆(F) is, the more similar A and B are. In order to
prevent excessive distortion, we assume an adjustment window (|ik − jk | ≤ r),
and consider minimizing ∆(F) in terms of F, where wk is a positive weight for
fk , wk = (ik − ik−1 ) + (jk − jk−1 ), i0 = j0 = 0. The minimization can be
resolved without checking all possible F since dynamic programming, of which
complexity is O(IJ), can be employed. The minimum value of ∆(F) gives the
value of G(A, B).
of their corresponding time sequences to the time sequence. The use of a time
sequence which exists in data in its split node is expected to contribute to com-
prehensibility of the classifier, and each time sequence is obtained by exhaustive
search. The dissimilarity measure is based on DTW described in section 2.2.
We call this split test a standard-example split test. A standard-example split
test σ(e, a, θ) consists of a standard example e, an attribute a, and a threshold
θ. Let a value of an example e in terms of a time-series attribute a be e(a), then
a standard-example split test divides a set of examples e1 , e2 , · · · , en to a set
S1 (e, a, θ) of examples each of which ei (a) satisfies G(e(a), ei (a)) < θ and the
rest S2 (e, a, θ). We also call this split test a θ-guillotine cut.
As the goodness of a split test, we have selected gain ratio [12] since it is
frequently used in decision-tree induction. Since at most n − 1 split points are
inspected for an attribute in a θ-guillotine cut and we consider each example as a
candidate of a standard example, it frequently happens that several split points
exhibit the largest value of gain ratio. We assume that consideration on shapes
of time sequences is essential in comprehensibility of a classifier, thus, in such a
case, we define that the best split test exhibits the largest gap between the sets of
time sequences in the child nodes. The gap gap(e, a, θ) of σ(e, a, θ) is equivalent
to G(e (a), e(a)) − G(e (a), e(a)) where e and e represent the example ei (a)
in S1 (e, a, θ) with the largest G(e(a), ei (a)) and the example ej (a) in S2 (e, a, θ)
with the smallest G(e(a), ej (a)) respectively. When several split tests exhibit the
largest value of gain ratio, the split test with the largest gap(e, a, θ) among them
is selected.
Below we show the procedure standardExSplit which obtains the best
standard-example split test, where ω.gr and ω.gap represent the gain ratio and
the gap of a split test ω respectively.
Procedure: standardExSplit
Input: Set of examples e1 , e2 , · · · , en
Return value: Best split test ω
1 ω.gr = 0
2 Foreach(example e)
3 Foreach(time-series attribute a)
4 Sort examples e1 , e2 , · · · , en in the current node using G(e(a), ei (a)) as
a key to 1 , 2 , · · · , n
5 Foreach(θ-guillotine cut ω of 1 , 2 , · · · , n )
6 If ω .gr > ω.gr
7 ω = ω
8 Else If ω .gr == ω.gr And ω .gap > ω.gap
9 ω = ω
10 Return ω
We have also proposed a cluster-example split test σ (e , e , a) for compar-
ison. A cluster-example split test divides a set of examples e1 , e2 , · · · , en into
a set U1 (e , e , a) of examples each of which ei (a) satisfies d(e (a), ei (a)) <
d(e (a), ei (a)) and the rest U2 (e , e , a). The goodness of a split test is equivalent
to that of the standard-example split test without θ.
Experimental Evaluation of Time-Series Decision Tree 195
2
The original source is http://www.cse.unsw.edu.au/˜ waleed/tml/data/.
3
We admit that we can employ a different preprocessing method such as considering
regularity of the tests.
4
i.e. AST.
5
i.e. ALT.
6
For each time sequence, 50 points with an equivalent interval were generated by
linear interpolation between two adjacent points.
196 Y. Yamada et al.
pruning method [11] and the adjustment window r of DTW was settled to 10 %
of the total length7 .
For comparative purpose, we have chosen methods presented in section 1.
In order to investigate the effect of our exhaustive search and use of gaps, we
tested modified versions of SE-split which select each standard example from
randomly chosen 1 or 10 examples without using gaps as Random 1 and Random
10 respectively. Intuitively, the former corresponds to a decision-tree version of
[6]. In our implementation of [4], the maximum number of segments was set to
3 since it outperformed the cases of 5, 7, 11. In our implementation of [7], we
used average, maximum, minimum, median, and mode as global features, and
Increase, Decrease, and Flat with 3 clusters and 10-equal-length discretization
as its “parametrised event primitives”. Av-split, and 1-NN represent the split
test for the average values of time sequences, and the nearest neighbor method
with the DTW-based measure respectively.
It is widely known that the performance of a nearest neighbor method largely
depends on its dissimilarity measure. As the result of trial and error, the following
dissimilarity measure H(ei , ej ) has been chosen since it often exhibits the highest
accuracy.
m
G(ei (ak ), ej (ak ))
H(ei , ej ) = ,
q(ak )
k=1
where q(ak ) is the maximum value of the dissimilarity measure for a time-series
attribute ak , i.e. q(ak ) ≥ ∀i∀j G(e(ai ), e(aj )).
In the standard-example split test, the value of a gap gap(e, a, θ) largely
depends on its attribute a. In the experiments, we have also tested to substitute
gap(e, a, θ)/q(a) for gap(e, a, θ). We omit the results in the following sections
since this normalization does not necessarily improve accuracy.
7
A value less than 1 is round down.
Experimental Evaluation of Time-Series Decision Tree 197
much faster8 . We have noticed that [4] might be vulnerable to outliers but this
should be confirmed by investigation. We attribute the poor performance of our
[7] to our choice of parameter values, and consider that we need to tune them
to obtain satisfactory results.
In terms of accuracy, Random 1, [7], and Av-split suffer in the EEG data
set and the latter two in the Sign data set. This would suggest effectiveness of
exhaustive search, small number of parameters, and explicit handling of time
sequences. 1-NN exhibits high accuracy in the Sign data set and relatively low
accuracy in the EEG data set. These results might come from the fact that all
attributes are relevant in the sign data set while many attributes are irrelevant
in the EEG data set.
CE-spit, although it handles the structure of a time sequence explicitly, al-
most always exhibits lower accuracy than SE-split. We consider that this is due
to the fact that CE-split rarely produces pure child nodes9 since it mainly di-
8
We could not include the results of the latter for two data sets since we estimate
them several months with our current implementation. Reuse of intermediate results,
however, would significantly shorten time.
9
A pure child node represents a node with examples belonging to the same class.
198 Y. Yamada et al.
vides a set of examples based on their shapes. We have observed that SE-split
often produces nearly pure child nodes due to its use of the θ-guillotine cut.
Experimental results show that DTW is sometimes time-inefficient. It should
be noted that time is less important than accuracy in our problem. Recent
advances such as [9], however, can significantly speed up DTW.
In terms of the size a decision tree, SE-split, CE-split, and [4] constantly
exhibit good performance. It should be noted that SE-split and CE-split show
similar tendencies though the latter method often produces slightly larger trees
possibly due to the pure node problem. A nearest neighbor method, being a
lazy learner, has deficiency in comprehensiveness of its learned results. This
deficiency can be considered as crucial in application domains such as medicine
where interpretation of learned results is highly important.
100
1-nearest neighbor
40
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
ratio of training examples
The difference between Tables 1 and 2 in time mainly comes from the number
of examples in the training set and the test set. The execution time of 1-NN,
being a lazy learner, is equivalent to the time of the test phase and is roughly
proportional to the number of test examples. As the result, it runs faster with
leave-one-out than 20 × 5-fold cross validation, which is the opposite tendency
of a decision-tree learner.
The accuracy of SE-split often degrades substantially with 20 × 5-fold cross
validation compared with leave-one-out. We attribute this to the fact that a
“good” example, which is selected as a standard example if it is in a training
set, belongs to the test set in 20 × 5-fold cross validation more frequently than in
leave-one-out. In order to justify this assumption, we show the learning curve10
of each method for EEG data, which consists of the largest number of examples
in the four data sets, in Figure 4. In order to increase the number of examples, we
counted a situation of a patient as an example. Hence an example is described by
64 attributes, and we have picked up 250 examples from each class randomly. We
omitted several methods due to their performance in Tables 1 and 2. From the
Figure, the accuracy of the decision-tree learner with the standard-example split
test degrades heavily when the number of examples is small. These experiments
show that our standard-example split test is appropriate for a data set with a
relatively large number of examples.
10
Each accuracy represents an average of 20 trials.
Experimental Evaluation of Time-Series Decision Tree 199
600
500
400
300
CHE(278)
similar 200 dissimilar
100
-500 -250 0 250 500
400 LC patients = 13
300 GPT(404)
200
100
0
similar -500 -250 0 250 500 dissimilar
6
LCpatients = 5
ALB(281)
5
3
similar -500 -250 0 250 500 dissimilar
Fig. 5. Time-series tree learned from H0 (chronic hepatitis data of the first biopsies)
The classifier obtained below might realize such a story. We have prepared
another data set, which we call H0, from the chronic hepatitis data by dealing
with the first biopsies only. H0 consists of 51 examples (21 LC patients, and
30 non-LC patients) each of which is described with 14 attributes. Since a pa-
tient who underwent a biopsy is typically subject to various treatments such as
interferon, the use of the first biopsies only enables us to analyze a more nat-
ural stage of the disease than in H1 and H2. We show the decision tree with
the standard-example split test learned from H0 in Figure 5. In the Figure, a
number described subsequent to an attribute in parentheses represents a patient
ID, and a leaf node predicts its majority class. A horizontal dashed line in a
graph represents a border value of two categories (e.g. normal and high) of the
corresponding medical test.
200 Y. Yamada et al.
The decision tree and the time sequences employed in the Figure have at-
tracted interests of physicians, and were recognized as an important discovery.
The time-series tree investigates potential capacity of a liver by CHE and pre-
dicts patients with low capacity as liver cirrhosis. Then the tree investigates the
degree of inflammation for other patients by GPT, and predicts patients with
heavy inflammation as liver cirrhosis. For other patients, the tree investigates
another sort of potential capacity of a liver by ALB and predicts liver cirrhosis
based on the capacity. This procedure highly agrees with routine interpretation
of blood tests by physicians. Our proposed method was highly appraised by them
since it discovered results which are highly consistent to knowledge of physicians
by only using medical knowledge on relevant attributes. They consider that we
need to verify plausibility of this classifier in terms of as much information as
possible from various sources then eventually move to biological tests.
During the quest, there was an interesting debate among the authors. Ya-
mada and Suzuki, as machine learning researchers, proposed abstracting the
time sequences in Figure 5. Yokoi and Takabayashi, who are physicians, how-
ever, insisted to use time sequences that exist in the data set11 . The physicians
are afraid that such abstracted time sequences are meaningless, and claim that
the use of real time sequences is appropriate from the medical point of view.
We show the results of the learning methods with H0 in Table 3. Our time-
series tree outperforms other methods in accuracy12 , and in tree size except for
[7]. Test of significance based on two-tailed t-distribution shows that the differ-
ences of tree sizes are statistically significant. We can safely conclude that our
method outperforms other methods when both accuracy and tree sizes are con-
sidered. Moreover, inspection of mis-predicted patients in leave-one-out revealed
that most of them can be considered as exceptions. This shows that our method
is also effective in detecting exceptional patients.
11
Other physicians supported Yokoi and Takabayashi. Anyway they are not reluctant
to see the results of abstracted patterns too.
12
By a binary test with correspondence, however, we cannot reject, for example, the
hypothesis that SE-split and Av-split differ in accuracy. We need more examples to
show superiority of our approach in terms of accuracy.
Experimental Evaluation of Time-Series Decision Tree 201
We obtained the following comments from medical experts who are not au-
thors of our articles.
– The proposed learning method exhibits novelty and is highly interesting. The
splits in the upper parts of the time-decision trees are valid, and the learning
results are surprisingly well as a method which employs domain knowledge
on attributes only.
– Medical test values which are measured after a biopsy are typically influenced
by treatment such as interferon (IFN). It would be better to use only medical
test values which were measured before a biopsy.
– 1000 days are long as a period of measurement since the number n of patients
is small. It would be better to use shorter periods such as 365 days.
– The number of medical tests might be possibly reduced to 4 per year. Predic-
tion from a smaller number of medical tests has a higher impact on clinical
treatment13 .
– A medical expert is familiar with sensitivity, specificity, and an ROC curve
as evaluation indices of a classifier. It causes more problems to overlook an
LC patient than mistake a non-LC patient.
LC non-LC
LC (Prediction) TP FP
non-LC (Prediction) F N TN
TP
Sensitivity (True Positive Rate) = (2)
TP + FN
TN
Specif icity (True Negative Rate) = (3)
TN + FP
where C represents a user-specified weight. We settled C = 5 throughout the
experiments, and employed a leave-one-out method. Note that Cost is normal-
ized in order to facilitate comparison of experimental results from different data
sets.
It is reported that Laplace correction is effective in decision tree induction
for cost-sensitive classification [2]. We obtained the probability Pr(a) of a class
a when there are ν(a) examples of a among ν examples as follows.
ν(a) + l
Pr(a) = (4)
ν + 2l
where l represents a parameter of the Laplace correction. We settled l = 1 unless
stated.
We modified data selection criteria in each series of experiments and prepared
various data sets as shown in Table 5. In a name of a data set, the first figure
represents the number of days of the selected period of measurement before a
biopsy, the figure subsequent to a “p” represents the number of required medical
tests, and the figure subsequent to an “i” represents the number of days of an
interval in interpolation. Since we employed both B-type patients and C-type
patients in all experiments, each name of a data set contains a string “BC”. Since
we had obtained novel data of biopsies after [15], we employed an integrated
version in the experiments.
Table 6. Results of experiments for test numbers, where data sets p6, p3, and p2
represent 180BCp6i5, 180BCp3i5, and 180BCp2i5 respectively
Table 7. Results for accuracy, size, and cost of experiments for periods, where data sets
90, 180, 270, and 360 represent 90BCp3i5, 180BCp6i5, 270BCp9i5, and 360BCp12i5
respectively
sensitivity specificity
method 90 180 270 360 90 180 270 360
Combined 0.50 0.52 0.33 0.23 0.87 0.87 0.77 0.61
Average 0.61 0.39 0.40 0.54 0.86 0.99 0.95 0.83
Line 0.39 0.57 0.47 0.38 0.89 0.94 0.85 0.56
Table 9. Results for accuracy, size, and cost of experiments for intervals, where data
sets i2, i4, i6, i8, and i10 represent 180BCp6i2, 180BCp6i4, 180BCp6i6, 180BCp6i8,
and 180BCp6i10 respectively
Table 6. From the Table, we see that the average-split test and the line-split
test outperform other methods in cost for p2 and p6 respectively. For p3, the
methods exhibit the same cost and outperform our standard-example split test.
We believe that the poor performance of our method is due to lack of informa-
tion on shapes of time sequences and the number of examples. We interpret the
results that lack of the former information in p2 favors the average-split test,
while lack of the latter information in p6 favors the line-split test. If simplicity
204 Y. Yamada et al.
Table 10. Results for sensitivity and specificity of experiments for intervals
sensitivity specificity
method i2 i4 i6 i8 i10 i2 i4 i6 i8 i10
Combined 0.57 0.52 0.52 0.52 0.52 0.96 0.97 0.93 0.91 0.93
Average 0.43 0.43 0.39 0.43 0.39 0.99 0.99 0.99 0.99 0.97
Line 0.57 0.52 0.52 0.57 0.57 0.96 0.94 0.94 0.94 0.87
Table 11. Results of experiments for Laplace correction values with 180BCp6i6, where
methods C, A, and L represent Combined, Average, and Line respectively
of a classifier is also considered, the decision tree learned with the average-split
test from p2 would be judged as the best.
Secondly, we modified the selected period to 90, 180, 270, 360 days under an
interpolation interval 5 days and the number of required medical tests per 30
days 1. We show the results in Tables 7 and 8. From Table 7, we see that the
average-split test and the line-split test almost always outperform our standard-
example split test in cost though there is no clear winner between them. We
again attribute these to lack of information on shapes of time sequences and the
number of examples. Our standard-example split test performs relatively well for
90 and 180 and this would be due to their relatively large numbers of examples.
If simplicity of a classifier is also considered, the decision tree learned with the
line-split test from 180 would be judged as the best.
Thirdly, we modified the interpolation intervals to 2, 4, · · ·, 10 days under a
180-day period and the required number of medical tests 6. We show the results
in Tables 9 and 10. From Table 9, we see that our standard-example split test
and the line-split test outperform the average-split test in cost though there is no
clear winner between them. Since a 180 in Tables 7 and 8 represents 180BCp6i5,
it would be displayed as i5 in this Table. Our poor performance of cost 0.35 for i5
shows that our method exhibits good performance for small and large intervals,
and this fact requires further investigation. If simplicity of a classifier is also
considered, the line-split test is judged as the best and we again attribute this
to lack of information for our method.
Lastly, we modified the Laplace correction parameter l to 0, 1, · · · , 5 under a
180-day period, the required number of medical tests 6, and a 6-day interpolation
interval. We show the results in Table 11. From the Table, we see that the Laplace
correction increases cost for our standard-example split test and the line-split
Experimental Evaluation of Time-Series Decision Tree 205
test contrary to our expectation. Even for the average-split test, the case without
the Laplace correction (l = 0) rivals the best case with the Laplace correction
(l = 1). The Table shows that these come from the fact that the Laplace correc-
tion lowers sensitivity but this requires further investigation.
4.2 Experiments
We have compared our standard-example split test, the cluster-example split
test, the average-split test, a method by Geurts [4], and a method by Kadous
[7]. We settled Nmax = 5 in the method of Geurts, and the number of discretized
14
It should be noted that the obtained accuracy can be counter-intuitive in this kind
of experiments.
206 Y. Yamada et al.
480
460 CHE
Cost=0.18 440
420
400 dissimilar
similar 380
360
-500 -250 0 250 500
30
20 LC=15
10
0 T-CHO
similar -10 (gradient) dissimilar
-20
-30
-500 -250 0 250 500 2
LC=8
1.5 I-BIL
1
similar 0.5 dissimilar
0
-500 -250 0 250 500
LC=6 non-LC=34
LC=1
bins 5 and the number of clusters 5 in the method of Kadous. Experiments were
performed with a leave-one-out method, and without the Laplace correction.
We show the results in Table 13, and the decision trees learned from all data
with the standard-example split test, the cluster-example split test, and the
average-split test in Figures 6, 7, and 8 respectively. The conditions are chosen
so that each of them exhibits the lowest cost for the corresponding method.
Experimental Evaluation of Time-Series Decision Tree 207
Cost = 0.20
180 160
160 PLT-37 PLT-275 non-LC=1
140 LC=21
140
120 120
100 100
80
-500 -250 0 250 500 -500 -250 0 250 500
8
7.8 TP-37 7.8 TP-206
7.6 non-LC=24 7.4
7.4
7.2 7
7
6.8 6.6
6.6
-500 -250 0 250 500 -500 -250 0 250 500
110
100 GPT-913 90
GPT-925
90 LC=7 70
80
70 50
60 30
50
40 10
-500 -250 0 250 500 -500 -250 0 250 500
0.8 0.9
T-BIL-278 0.8 T-BIL-758
0.7 non-LC=9
0.7 LC=2
0.6 0.6
0.5 0.5
0.4
0.4 0.3
0.2
-500 -250 0 250 500 -500 -250 0 250 500
LC=15 T-BIL
< 0.96 >= 0.96
GOT non-LC=1
< 111.07 >= 111.07 LC=8
non-LC=32 non-LC=1
LC=4 LC=3
From the Table, we see that our standard-example split test performs better
with gain ratio, and the cluster-example split test and the average-split test
perform better with gain. We think that the former is due to affinity of gain
ratio, which tends to select an unbalanced split, to our standard-example split
test, which splits examples based on their similarities or dissimilarities to its
standard example. Similarly, we think that the latter is due to affinity of gain,
208 Y. Yamada et al.
5 Conclusions
For our time-series decision tree, we have investigated the case in which medical
tests before a biopsy are neglected and the case in which goodness of a split
test is altered. In the former case, our time-series decision tree is outperformed
by simpler decision trees in misclassification cost due to lack of information on
sequences and examples. In the latter case, our standard-example split test per-
forms better with gain ratio, and the cluster-example split test and the average-
split test perform better with gain probably due to affinities in each combination.
We plan to extend our approach as both a cost-sensitive learner and a discovery
method.
In applying a machine learning algorithm to real data, one often encounters
the problems of quantity, quality, and form. Data are massive (quantity), noisy
(quality), and in various shapes (form). Recently, the third problem has moti-
vated several researchers to propose classification from structured data including
time-series, and we believe that this tendency will continue in the near future.
Information technology in medicine has spread its target to multimedia data
such as time-series data and image data from string data and numerical data in
the last decade, and aims to support the whole process of medicine which handles
various types of data [14]. Our decision-tree induction method which handles the
shape of a time sequence explicitly, has succeeded in discovering results which
were highly appraised by physicians. We anticipate that there is a long way
toward an effective classification method which handles various structures of
multimedia data explicitly, but our method can be regarded as an important
step toward this objective.
Acknowledgement
This work was partially supported by the grant-in-aid for scientific research on
priority area “Active Mining” from the Japanese Ministry of Education, Culture,
Sports, Science and Technology.
References
1. P. Berka. ECML/PKDD 2002 discovery challenge, download data about hepati-
tis. http://lisp.vse.cz/challenge/ ecmlpkdd2002/, 2002. (current September 28th,
2002).
Experimental Evaluation of Time-Series Decision Tree 209
1 Introduction
Multi-aspect mining in a multi-phase KDD (Knowledge Discovery and Data Min-
ing) process is an important methodology for knowledge discovery from real-life
data [5, 22, 26, 27]. There are two main reasons why a multi-aspect mining ap-
proach needs to be used for hepatitis data analysis.
The first reason is that we cannot expect to develop a single data mining
algorithm for analyzing all main aspects of the hepatitis data towards a holistic
view because of complexity of the real-world applications. Hence, various data
mining agents need to be cooperatively used in the multi-phase data mining
process for performing multi-aspect analysis as well as multi-level conceptual
abstraction and learning.
The other reason is that when performing multi-aspect analysis for complex
problems such as hepatitis data mining, a data mining task needs to be decom-
posed into sub-tasks. Thus these sub-tasks can be solved by using one or more
data mining agents that are distributed over different computers. Thus the de-
composition problem leads us to the problem of distributed cooperative system
design.
More specifically, when therapy using IFN (interferon) medication for chronic
hepatitis patients, various conceptual knowledge/rules will benefit for giving a
Data Base
Meta-Learning
dd ddd
sssdd
treatment. The knowledge/rules, for instance, include (1) when the IFN should
be used for a patient so that he/she will be able to be cured, (2) what kinds
of inspections are important for a diagnosis, and (3) whether some peculiar
data/patterns exist or not.
The paper describes our work on cooperatively using various data mining
agents including the GDT-RS inductive learning system for discovering deci-
sion rules [23, 28], the LOI (learning with ordered information) for discovering
ordering rules and important features [15, 29], as well as the POM (peculiarity
oriented mining) for finding peculiarity data/rules [30], for multi-aspect analy-
sis of the hepatitis data so that such rules mentioned above can be discovered
automatically. Furthermore, by meta learning, the rules discovered by LOI and
POM can be used to improve the quality of decision rules discovered by GDT-
RS. Figure 1 gives a general model of our methodology for multi-aspect mining
and meta learning.
We emphasize that both pre-processing/post-processing steps are important
before/after using data mining agents. In particular, informed knowledge discov-
ery, in general, uses background knowledge obtained from experts (e.g. medical
doctors) about a domain (e.g. chronic hepatitis) to guide a spiral discovery pro-
cess with multi-phase such as pre-processing, rule mining, and post-processing,
towards finding interesting and novel rules/features hidden in data. Background
knowledge may be of several forms including rules already found, taxonomic re-
lationships, causal preconditions, ordered information, and semantic categories.
In our experiments, the result of the blood test of the patients, who received
laboratory examinations before starting the INF treatment, is first pre-treated.
After that, the pre-processed data are used for each data mining agent, re-
spectively. By using the GDT-RS, the decision rules with respect to knowing
whether a medical treatment is effective or not, can be found. And, by using
the LOI, what attributes affect the medical treatment of hepatitis C greatly can
be investigated. Furthermore, peculiar data/patterns with a positive/negative
212 M. Ohshima et al.
meaning can be checked out by using POM for finding interesting rules and/or
data cleaning. Our methodology and experimental results show that the perspec-
tive of medical doctors will be changed from a single type of experimental data
analysis towards a holistic view, by using our multi-aspect mining approach.
The rest of the paper is organized as follows. Section 2 describes how to pre-
process the hepatitis data and decide the threshold values for condition attributes
according to the background knowledge obtained from medical doctors. Section 3
gives an introduction to GDT-RS and discusses main results mined by using the
GDT-RS and post-processing, which are based on a medical doctor’s advice and
comments. Then in Sections 4 and 5, we extend our system by adding the LOI
(learning with ordered information) and POM (peculiarity oriented mining) data
mining agents, respectively, for multi-aspect mining and meta learning. Finally,
Section 6 gives concluding remarks.
2 Pre-processing
2.1 Selection of Inspection Data and Class Determination
We use the following conditions to extract inspection data.
Thus, 197 patients with 11 condition attributes as shown in Table 1 are selected
and will be used in our data mining agents.
Furthermore, the decision attribute (i.e. classes) is selected according to a
result of judging the IFN effect by using whether a hepatitis virus exists or not.
Hence, the 197 extracted patients can be classified into 3 classes as shown in
Table 2.
1. All the inspection values in one year before IFN is used for each patient are
divided into two groups, the first half and the second half of the inspection
values.
Time
first half second half
Inspection values which exist in 1year
2. When the absolute value of the difference between average values of the first
half and the second half of the inspection values exceeds the threshold, it is
estimated as up or down. Otherwise, it is estimated as “–” (i.e. no change).
Moreover, it is estimated as “?” in the case where not inspection data or
only once (i.e. a patient is examined only once).
Furthermore, the threshold values can be decided as follows.
– The threshold values for each attribute except GPT are set up to
10% of the normal range of each inspection data. As the change of a hepatitis
patient’s GPT value will exceed the normal range greatly, the threshold value
for the GPT needs to be calculated in a more complex method to be described
below. The threshold values used for evaluating each condition attribute is
shown in Table 3.
– The threshold value for GPT is calculated as follows. As shown in Fig. 3,
the standard deviation of the difference of the adjacent test values of each
hepatitis patient’s GPT is first calculated, respectively; And then the stan-
dard deviation of such standard deviation is used as a threshold value.
Inspection value
T-CHO > 9.5 CHE > 25 ALB > 0.12 TP > 0.17
T-BIL > 0.1 D-BIL > 0.03 I-BIL > 0.07 PLT > 20
WBC > 0.5 HGB > 0.6 GPT > 54.56
3 Miming by GDT-RS
3.1 GDT-RS
GDT-RS is a soft hybrid induction system for discovering decision rules from
databases with uncertain and incomplete data [23, 28]. The system is based on
a hybridization of the Generalization Distribution Table (GDT) and the Rough
Set methodology. We recall some principles of GDT-RS to understand this ap-
plication.
The main features of GDT-RS are the following:
– Biases for search control can be selected in a flexible way. Background knowl-
edge can be used as a bias to control the initiation of GDT and in the rule
discovery process.
– The rule discovery process is oriented toward inducing rules with high qual-
ity of classification of unseen instances. The rule uncertainty, including the
ability to predict unseen instances, can be explicitly represented by the rule
strength.
– A minimal set of rules with the minimal (semi-minimal) description length,
having large strength, and covering of all instances can be generated.
– Interesting rules can be induced by selecting a discovery target and class
transformation.
where P Gi [l] is the value of the l-th attribute in the possible generalization
P Gi
and nk is the number of values of k-th attribute. Certainly we have j Gij = 1
for any i. m
Assuming E = k=1 nk the equation Eq. (2) can be rewritten in the following
form:
⎧
⎪
⎪ nk
⎪
⎪
⎨ k∈{l| P Gi [l]=∗}
Gij = p(P Ij |P Gi ) = if P Ij ∈ P Gi (4)
⎪ E
⎪
⎪
⎪
⎩
0 otherwise.
Furthermore, rule discovery can be constrained by three types of biases cor-
responding to three components of the GDT so that a user can select more
1
For simplicity, the wild card will be sometimes omitted in the paper.
216 M. Ohshima et al.
general concept descriptions from an upper level or more specific ones from a
lower level, adjust the strength of the relationship between instances and their
generalizations, and define/select possible instances.
Rule Strength. Let us recall some basic notions for rule discovery from
databases represented by decision tables [10]. A decision table (DT) is a tu-
ple T = (U, A, C, D), where U is a nonempty finite set of objects called the
universe, A is a nonempty finite set of primitive attributes, and C, D ⊆ A are
two subsets of attributes that are called condition and decision attributes, re-
spectively [14, 16]. By IN D(B) we denote the indiscernibility relation defined
by B ⊆ A, [x]IN D(B) denotes the indiscernibility (equivalence) class defined by
x, and U/B the set of all indiscernibility classes of IN D(B). A descriptor over
B ⊆ A is any pair (a, v) where a ∈ A and v is a value of a. If P is a conjunction
of some descriptors over B ⊆ A then by [P ]B (or [P ]) we denote the set of all
objects in DT satisfying P.
In our approach, the rules are expressed in the following form:
P → Q with S
i.e., “if P then Q with the strength S” where P denotes a conjunction of de-
scriptors over C (with non-empty set [P ]DT ), Q denotes a concept that the rule
describes, and S is a “measure of strength” of the rule defined by
S(P → Q) = s(P ) × (1 − r(P → Q)) (5)
where s(P ) is the strength of the generalization P (i.e., the condition of the
rule) and r is the noise rate function. The strength of a given rule reflects the
incompleteness and uncertainty in the process of rule inducing influenced both
by unseen instances and noise.
The strength of the generalization P = P G is given by Eq. (6) under that
assumption that the prior distribution is uniform
1
s(P ) = p(P Il |P ) = card([P ]DT ) × (6)
NP
l
where card([P ]DT ) is the number of observed instances satisfying the general-
ization P .
The strength of the generalization P represents explicitly the prediction for
unseen instances. On the other hand, the noise rate is given by Eq. (7)
card([P ]DT ) ∩ [Q]DT )
r(P → Q) = 1 − . (7)
card([P ]DT )
It shows the quality of classification measured by the number of instances sat-
isfying the generalization P which cannot be classified into class Q. The user
can specify an allowed noise level as a threshold value. Thus, the rule candidates
with the larger noise level than a given threshold value will be deleted.
One can observe that the rule strength we are proposing is equal to its con-
fidence [1] modified by the strength of the generalization appearing on the left
hand side of the rule. The reader can find in literature other criteria for rule
strength estimation (see e.g., [3, 7, 13]).
Spiral Multi-aspect Hepatitis Data Mining 217
Evaluation of Rules. From the viewpoint of the rules with a higher support
(e.g. rule-001 and rule-101), we observed that
– It will heal up in many cases if a patient is medicated with IFN at the time
when GPT is going up (hepatitis is getting worse);
– It does not heal up in many cases even if a patient is medicated with IFN
at the time when D-BIL is descending.
Furthermore, the following two points on the effect of IFN are understood
clearly.
– It is relevant to different types of hepatitis viruses;
– It is hard to be effective when there are large amounts of hepatitis virus.
Hence, we can see that rule-001 and rule-101 do not conflict with the existing
medicine knowledge.
From the two rules, the hypothesis:
“IFN is more effective when the inflammation of hepatitis is stronger”
can be formed. Based on this hypothesis, we can evaluate the rules discovered
as follows.
– In class R, the acceptability of the rules with respect to aggravation of the
liver function will be good.
– In class N, the acceptability of the rules with respect to recovery of the liver
function will be good.
220 M. Ohshima et al.
– Finding out the rules which are significant from the statistical point of view,
based on rough categorizing such as recovery and aggravation.
– Showing whether such rough categorizing is sufficient or not.
⎧
⎨ d(x, y), x {a} y,
Ia ((x, y)) = 0, x ∼ y, (13)
⎩
−d(x, y), x ≺{a} y.
⎧
⎨ y , x {a} y,
Ia ((x, y)) = =, x ∼ y, (14)
⎩
y ≺, x ≺{a} y.
⎧
⎨ x y, x {a} y,
Ia ((x, y)) = x = y, x ∼ y, (15)
⎩
x ≺ y, x ≺{a} y.
By using the ordered information table, the issue of ordering rules mining
can be regarded as a kind of classification. Hence, our GDT-RS rule mining
system can be used to generate ordering rules from an ordered information ta-
ble. The ordering rules can be used to predict the effect of IFN. Moreover, the
most important attributes can be found when a rule with too many condition
attributes.
In general, the large number of ordering rules may be generated. In order
to give a ranking about what attributes are important, we use the following
equation to analyse the frequency of each attribute in the generated ordering
rules.
ca
f= (16)
tc
where f is the frequency of attribute, ca is the coverage of the attribute, and tc
is the total coverage.
4.1 Experiment 1
In this experiment, we used the following 12 atrributes as condition attributes:
T-CHO, CHE, ALB, TP, T-BIL, D-BIL,
I-BIL, PLT, WBC, HGB, GPT, GOT
and use the same determination class as that used in GDT-RS (see Table 2).
After the patients who have no data of whether the hepatitis virus exists or not
are removed, the data of 142 patients was used.
For hepatitis C, it is possible to provide ordered relations in condition at-
tributes and the decision class, respectively. First, the class (?) is removed from
the dataset since we only consider the classes, R and N for mining by LOI.
Then we use background knowledge for forming data granules by discretiza-
tion of real-valued attributes as shown in Table 12, and the LOI will carry out
on the transformed information table. Furthermore, we need to use ordered in-
formation between attribute values for making the ordered information table.
In general, there exists background knowledge such as “if T-CHO lowers, then
hepatitis generally gets worse” for each attribute. Hence we can define ordered
information as shown in Table 13. From this table, we also can see that ordered
Spiral Multi-aspect Hepatitis Data Mining 225
information is directly given for the real-valued attributes, D-BIL, I-BIL, and
HGB, without discretization, since no specific background knowledge is used for
the discretization in this experiment.
The ordered information table is generated by using the standard-based method
(i.e. Eq. (14) as shown in Fig. 4). The ordered information table may be viewed as
an information table with added semantics (background knowledge). After such
transformation, ordering rules can be discoverd from the ordered information
table by using the GDT-RS rule mining system.
Results and Evaluation. The rules discovered by LOI are shown in Table 14,
where the condition attributes in the rules denote the inspection value “go bet-
ter” or “go worse”.
The evaluation of acceptability and novelty for the rules heavily depends
on the correctness of the ordered information. By investigating the values of
each attribute in the ordered information table by using Eqs. (17) and (18), the
correction rate of the background knowledge with respect to “go better” (or “go
worse”) can be obtained as shown in Table 15.
226 M. Ohshima et al.
#AT T,
attpos = (17)
#AT T, + #AT T,≺
#AT T≺,≺
attneg = (18)
#AT T≺, + #AT T≺,≺
where #AT T is the number of different attribute values of attribute AT T in the
ordered information table; , denotes that the attribute value is “go better”
and the patient is cured; , ≺ denotes that the attribute value is “go better”
but the patient is not cured; ≺, ≺ denotes that the attribute value is “go worse”
and the patient is not cured; and ≺, denotes that the attribute value is “go
worse” but the patient is cured.
The higher correction rate of the background knowledge (i.e. TP and HGB)
can be explained that the background knowledge is consistent with the specific
characteristics of the real collected data. On the contrary, the lower correction
rate (i.e. WBC) may mean that the order relation given by an expert may not
suitable for the specific data analysis. In this case, the order relation as common
background knowledge needs to be adjusted according to specific characteristics
of the real data such as the distribution and clusters of the real data. How to
adjust the order relation is an important ongoing work.
4.2 Experiment 2
Although the data is the same as used in Experiment 1, the ordered relations used
in Experiment 2 (see Table 16) is different. Furthermore, the ordered information
?
2. Create the ordered information table by using the standard-based method.
Object ALB CHE .. GOT CLASS
(p1,p2) N H ... N≺ =
(p1,p3) L L ... VH≺ N
:
(p2,p1) VH≺ VH≺ ... UH =
A1 A2 . . . Aj . . . Am
x11 x12 . . . x1j . . . x1m
x21 x22 . . . x2j . . . x2m
.. .. .. ..
. . . .
xi1 xi2 . . . xij . . . xim
.. .. .. ..
. . . .
xn1 xn2 . . . xnj . . . xnm
rule-ID rule
p-1 CHE(329.5)
p-2 HGB(17.24)
p-3 T-BIL(1.586) ∧ I-BIL(0.943) ∧ D-BIL(0.643)
p-4 CHE(196) ∧ PLT(73.667)
p-5 PLT(176.5) ∧ T-CHO(117.75)
p-6 ALB(4.733) ∧ I-BIL(0.95)
p-7 TP(8.46) ∧ GOT(175.8) ∧ GPT(382.2)
p-8 ALB(3.9) ∧ T-CHO(120.667) ∧ TP(6.65) ∧ WBC(2.783)
n
P F (xij ) = N (xij , xkj )α (19)
k=1
rule-ID rule
p-9 ALB(3.357) ∧ HGB(12.786) ∧ D-BIL(0.057) ∧ T-CHO(214.6) ∧ TP(5.671)
p-10 ALB(4.6) ∧ GPT(18.5)
p-11 HGB(12.24) ∧ I-BIL(0.783) ∧ GOT(143.167)
p-12 PLT(314) ∧ T-BIL(0.3) ∧ I-BIL(0.2) ∧ WBC(7.7)
p-13 CHE(361) ∧ PLT(253.5) ∧ WBC(8.5)
p-14 WBC(8.344)
p-15 ALB(3.514) ∧ CHE(133.714) ∧ T-BIL(1.229) ∧ I-BIL(0.814) ∧
D-BIL(0.414) ∧ GOT(160) ∧ GPT(146.857)
p-16 GOT(130.833) ∧ GPT(183.833)
p-17 T-CHO(127.5)
p-18 CHE(96.714) ∧ T-CHO(110.714) ∧ WBC(4.175)
p-19 PLT(243.429) ∧ WBC(7.957)
rule-ID rule
p-1’ CHE(N)
p-2’ HGB(N)
p-3’ T-BIL(H) ∧ I-BIL(H) ∧ D-BIL(VH)
p-4’ CHE(N) ∧ PLT(VL)
p-5’ PLT(N) ∧ T-CHO(L)
p-6’ ALB(N) ∧ I-BIL(H)
p-7’ TP(H) ∧ GOT(VH) ∧ GPT(UH)
p-8’ ALB(L) ∧ T-CHO(L) ∧ TP(N) ∧ WBC(VL)
rule-ID rule
p-9’ ALB(L) ∧ HGB(N) ∧ D-BIL(N) ∧ T-CHO(N) ∧ TP(L)
p-10’ ALB(N) ∧ GPT(N)
p-11’ HGB(N) ∧ I-BIL(N) ∧ GOT(VH)
p-12’ PLT(N) ∧ T-BIL(N) ∧ I-BIL(N) ∧ WBC(N)
p-13’ CHE(N) ∧ PLT(N) ∧ WBC(N)
p-14’ WBC(N)
p-15’ ALB(L) ∧ CHE(L) ∧ T-BIL(H) ∧ I-BIL(N) ∧
D-BIL(H) ∧ GOT(VH) ∧ GPT(VH)
p-16’ GOT(VH) ∧ GPT(VH)
p-17’ T-CHO(N)
p-18’ CHE(VL) ∧ T-CHO(L) ∧ WBC(N)
p-19’ PLT(N) ∧ WBC(N)
Table 23. Some addition and change of data granules for Table 12
give the mined results with discretization and granulation by using background
knowledge as shown in Tables 12 and 23.
We have also worked with Suzuki’s group to integrate the Peculiarity Ori-
ented Mining approach with the Exception Rules/Data Mining approach for
discovering more refined LC (Liver Cirrhosis) and non-LC classification mod-
els [9, 19].
Fourth, if the peculiar data found by POM are with a negative meaning, then
remove them from the dataset and the GDT-RS will be carried out again on the
cleaned dataset.
Fifth, if the peculiar data found by POM are with a positive meaning, then the
peculiarity rules will be generated by searching association among the peculiar
data (or their granules).
6 Conclusions
We presented a multi-aspect mining approach in a multi-phase, multi-aspect
hepatitis data analysis process. Both pre-processing and post-processing steps
are important before/after using data mining agents. Informed knowledge dis-
covery in real-life hepatitis data needs to use background knowledge obtained
from medical doctors to guide the spiral discovery process with multi-phase such
as pre-processing, rule mining, and post-processing, towards finding interesting
and novel rules/features hidden in data.
Our methodology and experimental results show that the perspective of med-
ical doctors will be changed from a single type of experimental data analysis
towards a holistic view, by using our multi-aspect mining approach in which
various data mining agents are used in a distributed cooperative mode.
Acknowledgements
This work was supported by the grant-in-aid for scientific research on priority
area “Active Mining” from the Japanese Ministry of Education, Culture, Sports,
Science and Technology.
References
1. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkano, A., “Fast Discovery
of Association Rules”, in: Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthu-
rusamy R. (eds.) Advances in Knowledge Discovery and Data Mining, The MIT
Press (1996) 307-328.
2. Agrawal R. et al. “Fast Discovery of Association Rules”, Advances in Knowledge
Discovery and Data Mining, AAAI Press (1996) 307-328.
3. Bazan, J. G. “A Comparison of Dynamic and Non-dynamic Rough Set Methods
for Extracting Laws from Decision System” in: Polkowski, L., Skowron, A. (Eds.),
Rough Sets in Knowledge Discovery 1: Methodology and Applications, Physica-
Verlag (1998) 321-365.
4. Dong, J.Z., Zhong, N., and Ohsuga, S. “Probabilistic Rough Induction: The GDT-
RS Methodology and Algorithms”, in: Z.W. Ras and A. Skowron (eds.), Founda-
tions of Intelligent Systems, LNAI 1609, Springer (1999) 621-629.
5. Fayyad, U.M., Piatetsky-Shapiro, G, and Smyth, P. “From Data Mining to Knowl-
edge Discovery: an Overview”, Advances in Knowledge Discovery and Data Mining,
MIT Press (1996) 1-36.
234 M. Ohshima et al.
24. Zhong, N. “Knowledge Discovery and Data Mining”, the Encyclopedia of Micro-
computers, Volume 27 (Supplement 6) Marcel Dekker (2001) 93-122.
25. Zhong, N., Yao, Y.Y., Ohshima, M., and Ohsuga, S. “Interestingness, Peculiarity,
and Multi-Database Mining”, Proc. 2001 IEEE International Conference on Data
Mining (ICDM’01), IEEE Computer Society Press (2001) 566-573.
26. Zhong, N. and Ohsuga, S. “Automatic Knowledge Discovery in Larger Scale
Knowledge-Data Bases”, C. Leondes (ed.) The Handbook of Expert Systems, Vol.4:
Chapter 29, Academic Press (2001) 1015-1070.
27. Zhong, N., Liu, C., and Ohsuga, S. “Dynamically Organizing KDD Process”, In-
ternational Journal of Pattern Recognition and Artificial Intelligence, Vol. 15, No.
3, World Scientific (2001) 451-473.
28. Zhong, N., Dong, J.Z., and Ohsuga, S. “Rule Discovery by Soft Induction Tech-
niques”, Neurocomputing, An International Journal, Vol. 36 (1-4) Elsevier (2001)
171-204.
29. Zhong, N., Yao, Y.Y., Dong, J.Z., Ohsuga, S., “Gastric Cancer Data Mining with
Ordered Information”, J.J. Alpigini et al (eds.) Rough Sets and Current Trends in
Computing, LNAI 2475, Springer (2002) 467-478.
30. Zhong, N., Yao, Y.Y., Ohshima M., “Peculiarity Oriented Multidatabase Mining”,
IEEE Transactions on Knowledge and Data Engineering, Vol.15, No.4 (2003) 952-
960.
Sentence Role Identification in Medline Abstracts:
Training Classifier with Structured Abstracts
1 Introduction
With the rapid increase in the volume of scientific literature, there has been growing
interest in systems with which researchers can find relevant pieces of literature with less
effort. Online literature retrieval services, including PubMed [11] and CiteSeer [7], are
gaining popularity, as they provide users with access to large database of abstracts or
full papers.
PubMed helps researchers in medicine and biology by enabling search in the Medline
abstracts [9]. It also supports a number of auxiliary ways to filter search results. For
example, users can limit search with the content of the title and journal fields, or by
specifying the range of publication dates. All these additional facilities, however, rely
on information external to the content of abstract text. This paper exploit information
inherent in abstracts, in order to make retrieval process more goal-oriented.
The information we exploit is the structure underlying abstract text. The literature
search system we implemented allows search to be limited within portions of an abstract,
where ‘portions’ are selected from the typical structural roles within an abstract, such
as Background, Objective, (Experimental) Methods, (Experimental) Results, and Con-
clusions. We expect such a system to substantially reduce users’ effort to narrow down
search results, which may be huge with only one or two query terms. This expectation
is based on the postulate that some of these ‘portions’ are more relevant to the goal of
users compared with the rest. For example, a clinician, trying to find whether an effect
of a chemical substance on a disease is known or not, can ask the search engine for
passages in which the names of the substance and the disease both occur, but only in
the sentences describing results and conclusions. And if a user wants to browse through
what is addressed in each paper, listing only sentences describing study objective (and
possibly conclusions) should be convenient.
With conventional search systems in which adding extra query terms is the only
measure of narrowing down, it is not easy (if not possible) to achieve the same effect as
the role-restricted search. Moreover, it is not always evident to users what extra query
terms are effective for narrowing down. Specifying the target roles of sentences may be
helpful in this case, providing an alternative way for limiting search.
An issue in building such a system is how to infer the roles of sentences in the ab-
stract collection. Because of the volume of the collection is often huge, manually labeling
sentences is not a viable option; it is therefore desirable to automate this process. To au-
tomate this task, it is natural to formulate the task as that of supervised text classification,
in which sentences are classified into one of the predefined set of structural roles. The
main topics of this paper are (1) how to reduce reliance on human supervision in making
training data for the sentence classifiers, and (2) what classes, or sections, should be
presented to users so that they can effectively restrict search. This last decision must be
made on account of a trade-off between usability and accuracy of sentence classification.
We also examine (3) what types of features are effective for classification.
The rest of this paper is organized as follows. We first present some statistics on
structured abstracts in Medline and state the set of classes used in the subsequent dis-
cussions (Section 2). Section 3 describes the techniques used for building classifier and
computing features used for classification. These features are described in Section 4.
We then present experimental results showing the effect of various features used for
classification (Section 5), followed by a summary of the Medline search system we
implemented (Section 6). Finally we conclude in Section 7.
# of abstracts / %
Structured 374,585 / 6.0%
Unstructured 5,912,271 / 94.0%
Total 11,299,108 / 100.0%
This section present and analyze some statistics on the structured abstracts contained
in Medline, which may affect the quality of training data and, in effect, the performance
of resulting sentence classifiers. We also determine the set of classes presented to users
on the basis of analysis in this section.
1 It is possible that a sentence belongs to two or more sections, such as when it consists of a clause
describing research background and another clause describing research objective. However,
since this is relatively rare, we assume a sentence is the minimum unit of role assignment
throughout the paper.
Sentence Role Identification in Medline Abstracts 239
presented to users so that they can specify the portion of abstract texts to which search
should be limited. It is natural to choose these classes from section headings occurring in
structured abstracts, because they reflect the conventional wisdom and it will allow us to
use those abstracts to train sentence classifiers used for labeling unstructured abstracts.
The problem is that there are more than 6,000 distinct headings in Medline 2002.
To maintain usability, the number of sections offered to users must be kept as small as
possible, but not so small as to impair the usefulness of role-limited search. But then, if
we restrict the number of classes, how should a section in a structured abstract be treated
when its heading does not match any of the restricted classes? In most cases, translating
a section into the classes of similar roles is straightforward, unless the selection of the
classes is unreasonable. For instance, identifying PURPOSE section with OBJECTIVE
section should generally be admissible. In practice, however, there are section headings
such as “BACKGROUND AND PURPOSES.” If BACKGROUND and PURPOSES are
two distinct classes presented to users, which we believe is a reasonable decision, we
need to determine which of these two classes each sentence in the section belongs to.
Therefore, at least some of the sentences in structured abstracts need to go through the
same labeling process as the unstructured abstracts.
Even when the section they belong to has a heading that seems straightforward to
assign a class, there are cases in which we have to classify sentences in a structured
abstract. The above mentioned OBJECTIVE (or PURPOSE) class is actually one such
class that needs sentence-wise classification. Below, we will analyze this case further.
As Table 3 shows, the most frequent sequences of headings are
1. BACKGROUND, METHOD(S), RESULTS, and CONCLUSION(S), followed by
2. OBJECTIVE, METHOD(S), RESULTS, and CONCLUSION(S).
240 M. Shimbo, T. Yamasaki, and Y. Matsumoto
Both of the above formats have only one of the two sections, BACKGROUND and OB-
JECTIVE, but not both. Inspecting abstracts conforming to these formats we found that
most of these two sections actually contain both the sentences describing research back-
ground, and those describing research objective; sentences describing research back-
ground occurred frequently in OBJECTIVE, and a large number of sentences describing
research objective were found in BACKGROUND as well.
This observation suggests that when an abstract contains only one BACKGROUND
or OBJECTIVE section, we should not take these headings as class labels for
granted. We should instead apply a sentence classifier to each sentence under these
headings.
To verify this claim, we computed Sibson’s information radius (Jensen-Shannon
divergence) [8] for each section. Information radius DJS between two probability distri-
butions p(x) and q(x) is defined as follows, using Kullback-Leibler divergence DKL .
1 p+q
p+q
DJS (pq) = DKL p + DKL q
2 2 2
1 p(x) q(x)
2 ∑
= p(x) log + ∑ q(x) log .
x p(x) + q(x) /2 x p(x) + q(x) /2
Hence, information radius is a measure of dissimilarity between distributions. It
is symmetric in p and q, and is always well-defined; these properties do not hold
with DKL .
Table 4 shows that the sentences under the BACKGROUND and OBJECTIVE sec-
tions have similar distributions of word bigrams (a sequence of consecutive words of
length 2) and the combination of words and word bigrams. Also note the smaller di-
vergence between these classes (bold faced figures), compared with those for the other
class pairs. The implication is that these two headings are probably unreliable as separate
class labels.
3 Technical Background
This section reviews the techniques used in designing sentence role classifiers.
f (x) = w · x + b = 0 w ∈ Ên, b ∈ Ê
in the n-dimensional space that separates positive examples from negative ones, i.e., for
any i ∈ {1, . . . , l},
yi f (xi ) = yi (w · x + b) > 0. (1)
Generally, if given data is indeed linearly separable, there exist infinitely many separating
hyperplanes that satisfy eq. (1). Among these, SVM chooses the one that maximizes the
margin, which is the distance from the nearest examples to the hyperplane. This margin-
maximization feature of SVM makes it less prone to over-fitting in a higher dimensional
feature space, as it has been proven that such “optimal” separating hyperplanes minimize
the expected error on test data [15]. The nearest example vectors to the separating
242 M. Shimbo, T. Yamasaki, and Y. Matsumoto
hyperplane are called support vectors, reflecting the fact that the optimal hyperplane (or
vector w) is a linear combination of these vectors.
Finding the optimal hyperplane reduces to solving a quadratic optimization problem.
After this hyperplane (namely, w and b) is obtained, SVM predicts label y for a given
test example x with the following decision function.
y = sgn( f (x)) = sgn(w · x + b).
Even if the examples are not linearly separable, the problem is still solvable with
a variation of SVMs, called soft-margin SVMs [4, 15], which allow a small amount of
exceptional examples (called bounded support vectors) to violate separability condition
(1). Acceptable range of exceptions is controllable through soft-margin parameter C ∈
[0, ∞), where smaller C tolerates a larger number of bounded support vectors.
It is also possible to first map the examples into higher-dimensional feature space
and find separating hyperplane in this space, with the help of the kernel trick [12].
4 Feature Set
In Section 5, we will examine various types of features for sentence role identification and
evaluate their performance. The types of features examined are described in this section.
These features can be categorized into those which represent the information internal to
the sentence in question, and those representing textual context surrounding the sentence.
We refer to the former as ‘non-contextual’ features and the latter as ‘contextual’ features.
4.2 Tense
Another type of non-contextual feature we examine is base on the tense of sentences.
A majority of authors seem to prefer changing tense depending on the structural role;
some of them write the background of study in the past perfect tense, and others write
objective in the present tense. To verify this claim, we examined the distribution of the
tense of 135489 sentences in 11898 structured abstracts in Medline 2002, all of which
have the section sequence of BACKGROUND, OBJECTIVE, METHOD, RESULTS
and CONCLUSION.
We obtained the tense information of each sentence as follows. Given a sentence as
input, we applied Charniak parser [2] to it, to obtain its phrase structure tree. We then
recursively applied the Collins head rules [3] to this phrase structure tree, which yields
the dependency structure tree of the sentence. From the dependency tree, we identify
the head verb of a sentence, which is the verb located nearest to the root of the tree. The
process so far is illustrated in Figures 1–3. Finally, the information on the tense of a head
verb can be obtained from the part-of-speech tag Charniak parser associates to it. Table 5
lists the Penn Treebank tags for verbs output by Charniak parser. We incorporate each
of these as a feature as well. A problem here is that the Penn Treebank tag associated
with head verb alone does not allow us to distinguish the verbs ‘have’, ‘has’ and ‘had’
used as a normal verb (as in ‘I had a good time.’) from those used as a auxiliary verb
introducing present or past perfect tense (as in ‘I have been to Europe.’). Hence, we not
244 M. Shimbo, T. Yamasaki, and Y. Matsumoto
VP
NP PP
VBP DT NN IN NN
Fig. 1. Phrase structure subtree. Leaf nodes correspond to surface words, and each non-leaf node
is labeled with a syntax category
VP
(labelling)
NP PP
(enzyme) (7-chloro-4-nitrobenzofurazan)
VBP DT NN IN NN
... (labelling) (the) (enzyme) (with) (7-chloro-4-nitrobenzofurazan) ...
Fig. 2. Phrase structure subtree labeled with headwords. Bold arrows depicts the inheritance of
head words by the head rules, and inherited head words are shown in parentheses
(labelling)
(enzyme) with
NN NN
the (7-chloro-4-nitrobenzofurazan)
Fig. 3. Dependency tree. This tree is obtained from Figure 2, by first colescing the leaf nodes with
their parents, and then recursively coalescing the nodes connected with the bold arrows
Sentence Role Identification in Medline Abstracts 245
Table 5. The tags used in our experiment: Penn Treebank tags for verbs
Table 6. Distribution of the tense of head verbs included in each class (%)
Table 7. Distribution of the class given the tense of head verbs (%)
only use the tags of Table 5 as the features, but also introduce the features indicating
‘present perfect tense’ and ‘past perfect tense.’ The coordinate in a feature vector for
present perfect tense has 1 if the sentence has ‘have’ or ‘has’ as its head verb, and also
has another past participle verb depending on the head verb in its dependency structure
tree. The feature for ‘past perfect tense’ is defined likewise when a sentence has the head
verb of ‘had.’
Tables 6 and 7 respectively show the distribution of the tense of head verbs in each
class, and the distribution of classes given a tense. As we see from these tables, some
246 M. Shimbo, T. Yamasaki, and Y. Matsumoto
tense exhibits a strongly correlated with the class of the sentence. For example, if we
know that a sentence is written in present perfect tense, there is more than 85% of a
chance that the sentence is speaking about the background of the study. In addition, the
distribution of tense of the head verb is significantly different from one class to another.
In Section 5.3, we examine the effectiveness of the tense of the head verb of each
sentence as the feature for classification.
– It is unlikely that experimental results (RESULT) are presented before the descrip-
tion of experimental design (METHOD). Thus, whether the preceding sentences
have been labeled as METHOD conditions the probability of the present sentence
being classified as RESULT.
– The sentences of the same class have a good chance of occurring consecutively;
we would not expect authors to interleave sentences describing experimental results
(RESULT) with those in CONCLUSION and OBJECTIVE classes.
5 Experiments
This section reports the results of the experiments we conducted to examine the perfor-
mance of sentence classifiers.
the section sequence consisting of these sections are only second after the sequence
BACKGROUND / METHOD(S) / RESULTS / CONCLUSION. However, identifying
the sentences with headings PURPOSE and AIM with those with OBJECTIVE makes
the corresponding sectioning scheme the most frequent. Hence, we collected structured
abstracts whose heading sequences match one of the following patterns:
After removing all symbols and replacing every contiguous sequence of numbers
with a single symbol ‘#’, we split each of these abstracts into sentences using the UIUC
sentence splitter. [14], We then filtered out the abstracts that produced a sentence with
less than three words, regarding it as a possible error in sentence splitting. This yielded
a total of 82,936 abstracts.
To reduce the number of features, we only took into account word bigrams occurring
in at least 0.05% of the sentences, which amounts to 9,078 distinct bigrams. On the
other hand, we used all words as the features to avoid null feature vectors. The number of
unigram word features was 104,733, and this makes a total of 113,811 features excluding
the number context features.
We obtained 103,962 training examples (sentences) from 10,000 abstracts randomly
sampled from the set of the 82,936 structured abstracts, and 10,356 test examples (sen-
tences) from 1,000 abstracts randomly sampled from the rest of the set.
For each of the nine context feature representations listed in Section 4.3, a set of four
soft-margin SVMs, one for each class, was built using the one-versus-the-rest configu-
ration, yielding a total of 4 · 9 = 36 SVMs. The quadratic kernel was used with SVMs,
and the soft margin (or capacity) parameter C was tuned independently for each SVM
to achieve best performance. The performance of different context features are shown
in Table 8.
In training SVMs with Features (1)–(4), the classes of surrounding sentences are the
correct labels obtained in the training data. In the test phase, because training data are
Accuracy (%)
Feature types sentence abstract
(0) Non-contextual features alone (No context features) 83.6 25.0
(1) (0) + Class label of the previous sentence 88.9 48.9
(2) (0) + Class labels of the previous two sentences 89.9 50.6
(3) (0) + Class label of the next sentence 88.9 50.9
(4) (0) + Class labels of the next two sentences 89.3 51.2
(5) (0) + Relative location of the current sentence 91.9 50.7
(6) (0) + Non-contextual features of the previous sentence 87.3 37.5
(7) (0) + Non-contextual features of the next sentence 88.1 39.0
(8) (0) + Non-contextual features of the previous and the next 89.7 46.4
sentences
248 M. Shimbo, T. Yamasaki, and Y. Matsumoto
not labeled, we first applied SVMs to the surrounding sentences, and used the induced
labels as their classes; thus, these context features are not necessarily correct.
There was not much difference in the performance of contextual features as far as
accuracy was measured on a per-sentence basis. All contextual features (1)–(8) obtained
about 90% accuracy, which is an improvement of 4 to 8% over the baseline (0) in which
no context features were used. By contrast, the accuracy on a per-abstract basis, in which
a classification of an abstract is judged to be correct only if all the constituent sentences
are correctly classified, varied between 37% and 51%. The maximum accuracy of 51%,
a 26% improvement over the baseline (0), was obtained for features (3), (4), and (5).
Thus, the abstracts having only one of BACKGROUND and OBJECTIVE are not in-
cluded in this experiment, in order to avoid the mixture of sentences of different roles
in these classes; such mixture was observed only when exactly one of these sections
occurred in an abstract.
Accuracy (%)
Feature types sentence abstract
(0) Non-contextual features alone (No context features) 83.5 19.5
(1) (0) + Class label of the previous sentence 90.0 50.2
(2) (0) + Class labels of the previous two sentences 91.1 53.2
(3) (0) + Class label of the next sentence 89.3 53.1
(4) (0) + Class labels of the next two sentences 90.1 54.4
(5) (0) + Relative location of the current sentence 92.4 47.9
(6) (0) + Non-contextual features of the previous sentence 86.9 32.0
(7) (0) + Non-contextual features of the next sentence 87.2 31.0
(8) (0) + Non-contextual features of the previous and the next 89.9 43.2
sentences
The abstracts in this collection went through the same preprocessing as in Section 5.1,
and the number of sentences that survived preprocessing was 135,489. We randomly
sampled 90% of the collected abstracts as training data, and retained the rest for testing.
For each of the 10 (= C(5, 2)) pairs of section labels, an SVM was trained only with
the sentences belonging to that specific pair of sections, using one section as positive
examples and the other as negative examples. Again, we used quadratic kernels and the
bag-of-words and word-bigrams features. No context features were used this time.
The result is shown in Table 9. Note that an F-measure of 96.1 was observed in dis-
criminating BACKGROUND from OBJECTIVE, which is a specific example discussed
in Section 2.2. This result implies the feasibility of the pairwise classification approach
in this task.
For reference, we present in Table 10 the performance of the multi-class classifi-
cation with pairwise combination using various context features. This experiment was
conducted under the same setting as Section 5.1. The difference in performance appears
to be marginal between the one-versus-the-rest (cf. Table 8) and pairwise combination
methods.
Table 12. Performance for individual class pair using the word, word bigrams and tense features
RESULTS. We note that these pairs coincide with those having quite distinct distribution
of tense (cf. Table 6).
Table 13. Frequency of sentence role classes in unstructured and structured abstracts
Table 15. Classification accuracy of unstructured abstracts using SVMs trained with structured
abstracts
Accuracy (%)
Feature types Sentence Abstract
(0) Words and word bigrams 75.4 15.0
(1) (0) + Class of the next sentence 68.1 16.3
(2) (0) + Classes of the next two sentences 64.2 10.7
(3) (0) + Class of the next sentence 71.7 15.7
(4) (0) + Classes of the next two sentences 72.2 15.6
(5) (0) + Non-contextual features of the previous sentence 76.2 17.3
(6) (0) + Non-contextual features of the next sentence 76.2 16.9
(7) (0) + Non-contextual features of the previous and next sentences 78.3 20.1
(8) (0) + Relative location 78.6 19.4
be that we trained the classifier using only structured abstracts with a fixed heading
sequence.
6 A Prototype Implementation
Using the feature set described in Section 3.1 together with the context feature (5) of
Section 4.3, we constructed five SVM classifiers, one for each of the five sections:
BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION. With these
SVMs, we labeled the sentences in the unstructured abstracts in Medline 2003 whose
publication year is either 2001 or 2002. The same labeling process was also applied to
the sentences in structured abstracts when correspondence was not evident between their
section headings and one of the above five sections. For instance, when the heading was
‘MATERIALS AND METHODS,’ we did not apply the labeling process, but regarded it
as ‘METHOD’; when it was ‘METHODS AND RESULTS,’ the classifiers were applied
in order to tell which of the two classes, METHOD and RESULT, each constituent
sentence belongs to.
As an exception, when a structured abstract contained either one of BACKGROUND
or OBJECTIVE (or equivalent) sections but not both, we also classified the sentences
in these sections into one of the BACKGROUND and the OBJECTIVE classes, using
252 M. Shimbo, T. Yamasaki, and Y. Matsumoto
the pairwise classifier described in Section 5.2 for discriminating these two classes. This
exception reflects the observation made in Section 2.2.
An experimental literature retrieval system was implemented using PHP on top of the
Apache web server. The full-text retrieval engine Namazu [10] was used as a back-end
search engine2 . A screenshot is shown in Figure 4. The form on the page contains a field
for entering query terms, a ‘Go’ button and radio buttons marked ‘Any’ and ‘Select from’
for choosing whether the keyword search should be performed on the whole abstract
texts, or on limited sections. If the user chooses ‘Select from’ button rather than ‘Any,’
the check boxes on its right are activated. These boxes corresponds to the five target
sections, BACKGROUND, OBJECTIVE, METHOD, RESULT, and CONCLUSION.
In the search result section in the lower half of the screen, the query terms found
in a specified section are highlighted in bold face letters, and the sections are shown in
different background colors.
2 We are currently re-implementing the system so that it uses a plain relational database as the
back-end, instead of Namazu.
Sentence Role Identification in Medline Abstracts 253
Acknowledgment
This research was supported in part by MEXT under Grant-in-Aid for Scientific Research
on Priority Areas (B) no. 759. The first author was also supported in part by MEXT under
Grant-in-Aid for Young Scientists (B) no. 15700098.
References
[1] Ad Hoc Working Group for Critical Appraisal of Medical Literature. A proposal for more
informative abstracts of clinical articles. Annals of Internal Medicine, 106(4):598–604,
1987.
[2] Eugene Charniak. A maximum-entropy-inspired parser. In Proceedings of the Sec-
ond Meeting of North American Chapter of Association for Computational Linguistics
(NAACL-2000), pages 132–139, 2000.
[3] Michael Collins. Head-Driven Statistical Models for Natural Language Processing. PhD
dissertation, University of Pennsylvania, 1999.
[4] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20:273–
297, 1995.
[5] M. A. K. Halliday and Ruqaiya Hasan. Cohesion in English. Longman, London, 1976.
[6] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields: prob-
abilistic models for segmenting and labeling sequence data. In Proceedings of the 18th
International Conference on Machine Learning (ICML-2001), pages 282–289. Morgan
Kaufmann, 2001.
[7] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. Digital libraries and autonomous citation
indexing. IEEE Computer, 32(6):67–71, 1999.
[8] Lillian Lee and Fernando Pereira. Measures of distributional similarity. In Proceedings of
the 37th Annual Meeting of the Association for Comutational Linguistics (ACL-99), pages
25–32, 1999.
[9] MEDLINE. http://www.nlm.nih.gov/databases/databases medline.html, 2002–2003. U.S.
National Library of Medicine.
[10] Namazu. http://www.namazu.org/, 2000.
[11] PubMed. http://www.ncbi.nlm.nih.gov/PubMed/, 2003. U.S. National Library of Medicine.
[12] Bernhard Schölkopf and Alex J. Smola. Learning with Kernels. MIT Press, 2002.
[13] Fei Sha and Fernando Pereira. Shallow parsing with conditional random fields. In Pro-
ceedings of the Human Language Technology Conference North American Chapter of As-
sociation for Computational Linguistics (HLT-NAACL 2003), pages 213–220, Edmonton,
Alberta, Canada, 2003. Association for Computational Linguistics.
[14] UIUC sentence splitter software. http://l2r.cs.uiuc.edu/ ∼cogcomp/cc-software.htm, 2001.
University of Illinois at Urbana-Champaign.
[15] Vladimir Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998.
CHASE2 – Rule Based Chase Algorithm
for Information Systems of Type λ
1 Introduction
Common problems encountered by Query Answering Systems (QAS), intro-
duced by Raś in [15], [18], either for Information Systems or for Distributed
Autonomous Information Systems (DAIS) include the handling of incomplete
attributes when answering a query. One plausible solution to answer a query
involves the generation of rules describing all incomplete attributes used in a
query and then chasing the unknown values in the local database with respect
to the generated rules. These rules can be given by domain experts and also
can be discovered either locally or at remote sites of DAIS. Since all unknown
values would not necessarily be found, the process is repeated on the enhanced
database until all unknowns are found or no new information is generated. When
the fixed point is reached, QAS will run the original query against the enhanced
database. The results of the query run on three versions of the information sys-
tem have been compared by Dardzińska and Raś in [5]: DAIS with a complete
local database, DAIS with incomplete local database (where incomplete infor-
mation can be represented only in terms of null values), and DAIS with a local
incomplete database enhanced by rule-based chase algorithm. The chase algo-
rithm presented in [5] was based only on consistent set of rules. The notion of a
tableaux system and the chase algorithm based on functional dependencies F is
presented for instance in [1]. Chase algorithm based on functional dependencies
always terminates if applied to a finite tableaux system. It was shown that, if
one execution of the algorithm generates a tableaux system that satisfies F , then
every execution of the algorithm generates the same tableaux system.
There are many methods to replace missing values with predicted values or
estimates [23], [19], [11], [7]. Some of them are given below:
– Most Common Attribute Value. It is one of the simplest methods to deal
with missing attribute values. The value of the attribute that occurs most
often is selected to be the value for all the unknown values of the attribute.
– Concept Most Common Attribute Value. This method is a restriction
of the first method to the concept, i.e., to all examples with the same value
of the decision. The value of the attribute, which occurs the most common
within the concept is selected to be the value for all the unknown values
of the attribute within that concept. This method is also called maximum
relative frequency method, or maximum conditional probability method.
– C4.5. This method is based on entropy and splitting the example with miss-
ing attribute values to all concepts [14].
– Method of Assigning all Possible Values of the Attribute. In this
method, an example with a missing attribute value is replaced by a set of
new examples, in which the missing attribute value is replaced by all possible
values of the attribute.
– Method of Assigning all Possible Values of the Attribute Restricted
to the Given Concept. This method is a restriction of the previous method
to the concept, indicated by an example with a missing attribute value.
– Event-Covering Method. This method is also a probabilistic approach
to fill in the unknown attribute values by event-covering. Event covering
is defined as a strategy of selecting a subset of statistically independent
events in the outcome space of variable-pairs, disregarding whether or not
the variables are statistically independent.
To impute categorical dimension missing values, two types of approaches can
be used:
– Rule Based Techniques (e.g., association rule, rule induction techniques,
etc.)
– Statistical Modelling (e.g., multinomial, log-linear modelling, etc.)
CHASE2 – Rule Based Chase Algorithm 257
Two main rule based models have been considered: rule induction techniques
and association rules. For categorical attributes with low cardinality domains
(few values), rule induction techniques such as decision tree [14] and decision
systems [10] can be used to derive the missing values. However, for categori-
cal attributes with large cardinality domains (many values), the rule induction
techniques may suffer due to too many predicted classes. In this case, the com-
bination of association relationships among categorical attributes and statistical
features of possible attribute values can be used to predict the best possible
values of missing data. The discovered association relationships among different
attributes can be thought as constraint information of their possible values and
can then be used to predict the true values of missing attributes.
Algorithm, presented in this paper, for predicting what attribute value should
replace an incomplete value of a categorical attribute in a given dataset has
a clear advantage over many other methods for predicting incomplete values
mainly because of the use of existing associations between values of attributes,
in a chase strategy, repeatedly for each newly generated dataset till some fix-
point is reached. To find these associations we can use either any association
rule mining algorithm or any rule discovery algorithm like LERS (see [8]), or
Rosetta (see [20]). Unfortunately, these algorithms, including Chase algorithm
presented by us in [5], do not handle partially incomplete data, where a(x) is
equal, for instance, to {(a1 , 1/4), (a2 , 1/4), (a3 , 1/2)}.
We assume here that a is an attribute, x is an object, and {a1 , a2 , a3 } ⊆ Va .
By Va we mean the set of values of attribute a. The weights assigned to these
three attribute values should be read as:
• the confidence that a(x) = a1 is 1/4,
• the confidence that a(x) = a2 is 1/4,
• the confidence that a(x) = a3 is 1/2.
In this paper we present a new chase algorithm (called Chase2 ) which can be
used for chasing incomplete information systems with rules which do not have
to be consistent (this assumption was required in Chase1 presented in [5]). We
propose how to compute the confidence of inconsistent rules and next we show
how these rules are used by Chase2.
So, the assumption placed on incompleteness of data in this paper allows to
have a set of weighted attribute values as a value of an attribute. Additionally,
we assume that the sum of these weights has to be equal 1. The definition of an
information system of type λ given in this paper is a modification of definitions
given by Dardzińska and Raś in [5],[4] and used later by Raś and Dardzińska in
[17] to talk about semantic inconsistencies among sites of DIS from the query
answering point of view. Type λ is introduced mainly to monitor the weights
assigned to values of attributes by Chase2 algorithm (the algorithm checks if
they are greater than or equal to λ). If the weight assigned by Chase2 to one of
the attribute values describing object x is below the acceptable threshold, then
this attribute value is no longer considered as a value which describes x.
258 A. Dardzińska and Z.W. Raś
X a b c d e
X a b c d e
where, for any NS (t1 ) = {(xi , pi )}i∈I , NS (t2 ) = {(xj , qj )}j∈J , we have:
– NS (t1 ) ⊕ NS (t2 ) =
{(xi , pi )}i∈(I−J) ∪ {(xj , pj )}j∈(J−I) ∪ {(xi , max(pi , qi ))}i∈I∩J ,
– NS (t1 ) ⊗ NS (t2 ) = {(xi , pi · qi )}i∈(I∩J) .
begin
if card(aj (x))
= 1 and {(ti → v) : i ∈ I}
is a maximal subset of rules from L(D)
that (x, pi ) ∈ NSj (ti ) then
such
if i∈I [pi · conf (ti → v) · sup(ti → v)] ≥ λ then
begin
bj (x) := bj (x)
∪ {(v, i∈I [pi · conf (ti → v) · sup(ti → v)])};
nj := nj + i∈I [pi · conf (ti → v) · sup(ti → v)]
end
end
qj := qj + nj ;
end
if Ψ (aj (x)) = [bj (x)/qj ] then aj (x) := [bj (x)/qj ];
j := j + 1;
end
S := {Sj : 1 ≤ j ≤ k}; /definition of {Sj : 1 ≤ j ≤ k} is given below/
if S
= S then Chase2 (S, In(A), L(D)) else Chase(S) := S
end
Information system S = {Sj : 1 ≤ j ≤ k} is defined as:
aS (x)={aSj (x): if a = aj for any j ∈ {1, 2, ..., k}}
for any attribute a and object x.
Still, one more definition is needed to complete the presentation of algorithm
Chase. Namely, we say that:
[bj (x)/p] = {(vi , pi /p)}i∈I , if bj (x) = {(vi , pi )}i∈I .
Algorithm Chase2 converts any incomplete or partially incomplete informa-
tion system S to a new information system which is more complete. At each
recursive call of Chase2 , its input data including S, L(D), and from time to
time In(A) are changing. So, before any recursive call is executed, these new
data have to be computed first.
Now, we give the time complexity (T − Comp) of algorithm Chase. Assume
first that S = S(0) = (X, A, V ), card(In(A)) = k, and n = card(X). We also
assume that S(i) = Chasei (S) and
n(i) = card{x ∈ X : (∃a ∈ A)[aS(i) (x)
= 1]}, both for i ≥ 0.
Clearly, n(0) > n(1) > n(2) > ... > n(p) = n(p + 1), because information
system Chasei+1 (S) is more complete than information system Chasei (S), for
any i ≥ 0.
p
T −Comp(Chase) = [ i=0 [k · [n + n(i) · card(L(D)) · n] + n(i)]] =
p
[ i=0 [k · [n(i) · card(L(D)) · n]]] = [k 2 · n3 · m].
The final worst case complexity of Chase is based on the observation that p can
not be larger than k · n. We also assume here that m = card(L(D)).
To explain the algorithm, we apply Chase2 to information system S3 pre-
sented by Table 3. We assume that L(D) contains the following rules (listed
264 A. Dardzińska and Z.W. Raś
X a b c d e
X a b c d e
with their support and confidence) defining attribute e and extracted from S3
by ERID:
r1 = [a1 → e3 ] (sup(r1 ) = 1, conf (r1 ) = 0.5)
r2 = [a2 → e2 ] (sup(r2 ) = 5/3, conf (r2 ) = 0.51)
r3 = [a3 → e1 ] (sup(r3 ) = 17/12, conf (r3 ) = 0.51)
r4 = [b1 → e1 ] (sup(r4 ) = 2, conf (r4 ) = 0.72)
r5 = [b2 → e3 ] (sup(r5 ) = 8/3, conf (r5 ) = 0.51)
r6 = [c2 → e1 ] (sup(r6 ) = 2, conf (r6 ) = 0.66)
r7 = [c3 → e3 ] (sup(r7 ) = 7/6, conf (r7 ) = 0.64)
r8 = [a3 · c1 → e3 ] (sup(r8 ) = 1, conf (r8 ) = 0.8)
r9 = [a3 · d1 → e3 ] (sup(r9 ) = 1, conf (r9 ) = 0.5)
r10 = [c1 · d1 → e3 ] (sup(r10 ) = 1, conf (r10 ) = 0.5)
CHASE2 – Rule Based Chase Algorithm 265
It can be noticed that values e(x1 ), e(x4 ), e(x6 ) of the attribute e are changed
in S3 by Chase2 algorithm. The next section shows how to compute these three
values and how to convert them, if needed, to a new set of values satisfying
the constraints required by system S4 to remain its λ status. Similar process is
applied to all incomplete attributes in S3 . After all changes corresponding to
all incomplete attributes are recorded, system S3 is replaced by Ψ (S3 ) and the
whole process is recursively repeated till a fix point is reached.
Algorithm Chase2 will compute new value for e(x1 ) = {(e1 , 1/2), (e2 , 1/2)}
denoted by enew (x1 ) = {(e1 , ?), (e2 , ?), (e3 , ?)}. To do that Chase2 identifies all
rules in L(D) supported by x1 . It can be easily checked that r1 , r2 , r4 , r5 , and
r10 are the rules supported by x1 . To calculate support of x1 for r1 , we take:
1 · 12 · 13 . In a similar way we calculate the support of x1 for the remaining rules.
As the result, we get the list of weighted values of attribute e supported by L(D)
for x1 , as follows:
(e3 , 13 · 1 · 12 + 13 · 83 · 100
51
+ 1 · 1 · 12 ) = (e3 , 1.119)
(e2 , 23 · 53 · 100
51
) = (e2 , 1.621)
(e1 , 3 · 2 · 100 ) = (e1 , 0.96).
2 72
we have:
enew (x4 ) = {(e1 , 0.5+0.722
0.722
), (e3 , 0.5+0.722
0.5
)} = {(e1 , 0.59), (e3 , 0.41)}
And finally, for x6 :
(e3 , 83 · 1 · 100
51
+ 1 · 76 · 100
64
) = (e3 , 2.11)
(e2 , 1 · 3 · 100 ) = (e2 , 0.85)
5 51
(e1 , 0)
we have:
enew (x6 ) = {(e2 , 2.11+0.85
0.85
), (e3 , 2.11+0.85
2.11
)} = {(e2 , 0.29), (e3 , 0.713)}
For λ = 0.3 the values of e(x1 ) and e(x6 ) will change to:
e(x1 ) = {(e2 , 0.59), (e3 , 0.41)}, e(x6 ) = {(e3 , 1)}.
Table 4 shows the resulting table.
Initial testing performed on several incomplete tables of the size 50 × 2, 000
with randomly generated data gave us quite promising results concerning the
precision of Chase2 . We started with a complete table S and removed from it
10 percent of its values randomly. This new table is denoted by S . For each
incomplete column in S , let’s say d, we use ERID to extract rules defining d in
terms of other attributes in S . These rules are stored in L(D). In the following
step, we apply Chase2 , making d maximally complete. Independently, the same
procedure is applied to all other incomplete columns. As the result, we obtain
266 A. Dardzińska and Z.W. Raś
a new table S . Now, the whole procedure is repeated again on S . The process
continues till the fix point is reached. Now, we compare the new values stored in
the empty slots of the initial table S with the corresponding values in S. Based
on this comparison, we easily compute the precision of Chase2 .
4 Conclusion
We expect much better results if a single information system is replaced by
distributed autonomous information systems investigated in [15], [17], [18]. This
is justified by experimental results showing higher confidence in rules extracted
through distributed data mining than in rules extracted through local mining.
References
1. Atzeni, P., DeAntonellis, V. (1992) Relational database theory, The Benjamin
Cummings Publishing Company
2. Benjamins, V. R., Fensel, D., Prez, A. G. (1998) Knowledge management through
ontologies, in Proceedings of the 2nd International Conference on Practical Aspects
of Knowledge Management (PAKM-98), Basel, Switzerland.
3. Chandrasekaran, B., Josephson, J. R., Benjamins, V. R. (1998) The ontology of
tasks and methods, in Proceedings of the 11th Workshop on Knowledge Acquisition,
Modeling and Management, Banff, Alberta, Canada
4. Dardzińska, A., Raś, Z.W. (2003) On Rules Discovery from Incomplete Informa-
tion Systems, in Proceedings of ICDM’03 Workshop on Foundations and
New Directions of Data Mining, (Eds: T.Y. Lin, X. Hu, S. Ohsuga, C. Liau),
Melbourne, Florida, IEEE Computer Society, 2003, 31-35
5. Dardzińska, A., Raś, Z.W. (2003) Chasing Unknown Values in Incomplete Infor-
mation Systems, in Proceedings of ICDM’03 Workshop on Foundations
and New Directions of Data Mining, (Eds: T.Y. Lin, X. Hu, S. Ohsuga, C.
Liau), Melbourne, Florida, IEEE Computer Society, 2003, 24-30
6. Fensel, D., (1998), Ontologies: a silver bullet for knowledge management and elec-
tronic commerce, Springer-Verlag, 1998
7. Giudici, P. (2003) Applied Data Mining, Statistical Methods for Business and
Industry, Wiley, West Sussex, England
8. Grzymala-Busse, J. (1991) On the unknown attribute values in learning from ex-
amples, in Proceedings of ISMIS’91, LNCS/LNAI, Springer-Verlag, Vol. 542, 1991,
368-377
9. Grzymala-Busse, J. (1997) A new version of the rule induction system LERS, in
Fundamenta Informaticae, IOS Press, Vol. 31, No. 1, 1997, 27-39
10. Grzymala-Busse, J., Hu, M. (2000) A Comparison of several approaches to missing
attribute values in data mining, in Proceedings of the Second International Confer-
ence on Rough Sets and Current Trends in Computing, RSCTC’00, Banff, Canada,
340-347
11. Little, R., Rubin, D.B. (1987) Statistical analysis with missing data, New York,
John Wiley and Sons
12. Pawlak, Z. (1991) Rough sets-theoretical aspects of reasoning about data, Kluwer,
Dordrecht
CHASE2 – Rule Based Chase Algorithm 267
1 Introduction
Clustering of time-series data [1] has been receiving considerable interests as
a promising method for discovering interesting features shared commonly by a
set of sequences. One of the most important issues in time-series clustering is
determination of (dis-)similarity between the sequences. Basically, the similarity
of two sequences is calculated by accumulating distances of two data points that
are located at the same time position, because such a distance-based similarity
has preferable mathematical properties that extend the choice of subsequent
grouping algorithms. However instead, this method requires that the lengths of
2 Methods
We implemented two clustering algorithms, agglomerative hierarchical clustering
(AHC) in [8] and rough sets-based clustering (RC) in [9]. For AHC we employed
two linkage criteria, average-linkage AHC (CL-AHC) and complete-linkage AHC
(AL-AHC). We also implemented algorithms of symmetrical time warping de-
scribed briefly in [2] and one-dimensional multiscale matching described in [4].
In the following subsections we briefly explain their methodologies.
Single Linkage. One way to select a cluster is to take the intergroup dissimi-
larity to be that of the closest pair:
where G and H are clusters to be merged in the next step, xi and xj are ob-
jects in G and H respectively, d(xi , xj ) is the dissimilarity between xi and xj .
If dSL (G, H) is the smallest among possible pairs of clusters, G and H will be
merged in the next step. The clustering based on this distance is called single link-
age agglomerative hierarchical clustering (SL-AHC), also called nearest-neighbor
technique.
where G and H are clusters to be merged in the next step. The clustering based
on this distance is called complete linkage agglomerative hierarchical clustering
(CL-AHC), also called furthest-neighbor technique.
Empirical Comparison of Clustering Methods 271
1
nG nH
dAL (G, H) = d(xi , xj )
nG + nH i=1 j=1
where G and H are clusters to be merged in the next step, nG and nH respectively
represent the numbers of objects in G and H. This is called the average linkage
agglomerative hierarchical clustering (AL-AHC), also called the unweighted pair-
group method using the average approach (UPGMA).
where
Pi = {xj | d(xi , xj ) ≤ T hdi }, ∀xj ∈ U. (2)
d(xi , xj ) denotes dissimilarity between objects xi and xj , and T hdi denotes an
upper threshold value of dissimilarity for object xi . The equivalence relation,
Ri classifies U into two categories: Pi , which contains objects similar to xi and
U − Pi , which contains objects dissimilar to xi . When d(xi , xj ) is smaller than
T hdi , object xj is considered to be indiscernible to xi . The threshold value T hdi
can be determined automatically according to the denseness of objects on the
dissimilarity plane [9]. Indiscernible objects under all equivalence relations form
a cluster. In other words, a cluster corresponds to a category Xi of U/IN D(R).
In the second step, we refine the initial equivalence relations according to
their global relationships. First, we define an indiscernibility degree, γ(xi , xj ),
for two objects xi and xj as follows.
|U | indis
k=1 δk (xi , xj )
γ(xi , xj ) = |U | |U | dis , (3)
δ indis (x , x ) + δ (x , x )
k=1 k i j k=1 k i j
where
1, if (xi ∈ [xk ]Rk ∧ xj ∈ [xk ]Rk )
δkindis (xi , xj ) = (4)
0, otherwise.
and ⎧
⎨ 1, if (xi ∈ [xk ]Rk ∧ xj ∈ [xk ]Rk ) or
δkdis (xi , xj ) = if (xi ∈ [xk ]Rk ∧ xj ∈ [xk ]Rk ) (5)
⎩
0, otherwise.
Equation (4) shows that δkindis (xi , xj ) takes 1 only when the equivalence rela-
tion Rk regards both xi and xj as indiscernible objects, under the condition
that both of them are in the same equivalence class as xk . Equation (5) shows
that δkdis (xi , xj ) takes 1 only when Rk regards xi and xj as discernible objects,
under the condition that either of them is in the same class as xk . By summing
δkindis (xi , xj ) and δkdis (xi , xj ) for all k(1 ≤ k ≤ |U |) as in Equation (3), we ob-
tain the percentage of equivalence relations that regard xi and xj as indiscernible
objects.
Objects with high indiscernibility degree can be interpreted as similar objects.
Therefore, they should be classified into the same cluster. Thus we modify an
equivalence relation if it has ability to discern objects with high γ as follows:
Ri = {{Pi }, {U − Pi }}
Pi = {xj |γ(xi , xj ) ≥ Th }, ∀xj ∈ U.
This prevents generation of small clusters formed due to the too fine clas-
sification knowledge. Th is a threshold value that determines indiscernibility of
objects. Therefore, we associate Th with roughness of knowledge and perform
iterative refinement of equivalence relations by constantly decreasing T h. Con-
sequently, coarsely classified set of sequences are obtained as U/IN D(R ). Note
that each refinement process is performed using the previously ’refined’ set of
equivalence relations, as the state of the indiscernibility degree may change after
refinement.
Empirical Comparison of Clustering Methods 273
(k )
la
(k )
LA k
i
la ( )
a
(k )
i
i
g a( k )
θ ak ( )
i
ϕA k
ΦA k
i
( ) ( )
The range of t(= Φ) was set to 0 ≤ t < 6π. The sampling interval was 1/500Φ.
Thus each data consisted of 500 points. The scale σ for MSM was changed for 30
levels, starting from 1.0 to 30.0 with the interval of 1.0. Note that the replacement
of segments would theoretically never occur because all of the eight sequences
were generated from single sine wave. Practically, some minor replacement oc-
curred because implementation of the algorithm required exceptional treatment
at the both ends of the sequences. However, since they affected all the sequences
commonly, we simply ignored the influence.
Figures 2–5 provide the shapes of nine test sequences: w1 to w9 . Sequence
w1 was a simple sine wave which was also the basis for generating other eight
sequences. For references, w1 is superimposed on each of the figures. Sequences
w2 and w3 were generated by changing amplitude of w1 to twice and half respec-
tively. These two sequences and w1 had inflection points at the same places and
their corresponding subsequences had the same lengths, phases, and gradients.
Thus we used these sequences for evaluation of contribution of rotation angles
to representing difference on amplitude. Sequence w4 was generated by adding
−0.5π delay to w1 and used for evaluation of contribution of the phase term. Se-
quence w5 and w6 were generated by adding long-term increasing and decreasing
trends to w1 . We used them for evaluating the contribution of the gradient term.
Sequences w7 and w8 were generated by exponentially changing amplitude of w1 .
They were used for testing how the dissimilarity behaves for nonlinear change
of amplitude. Sequence w9 was generated by changing frequency and amplitude
of w1 to twice. It was used to test sensitivity to compression in time domain.
As we describe in Section 4.1, time-series medical data may contain sequences
of unequal length. Thus it becomes important to demonstrate how the dissim-
ilarity measure deals with such cases. Note that we did not directly evaluated
the length term because it was hard to create a wave such that only length of a
subsequence changes while preserving other factors.
Tables 1 and 2 provide the dissimilarity matrices obtained by applying mul-
tiscale matching and DTW to each pairs of the nine sequences, respectively.
In order to evaluate the basic topological property of the dissimilarity, we here
listed all elements in the matrix. From Table 1, it can be confirmed that the
Empirical Comparison of Clustering Methods 275
3 3
w1 w1
w2 w5
2 w3 2 w6
w4
1 1
yx(t)
y(t)
0 0
-1 -1
-2 -2
-3 -3
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
t t
3 3
w1 w1
w7 w9
2 w8 2
1 1
y(t)
y(t)
0 0
-1 -1
-2 -2
-3 -3
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18
t t
w1 w2 w3 w4 w5 w6 w7 w8 w9
w1 0.000 0.574 0.455 0.644 0.541 0.611 0.591 0.662 0.782
w2 0.574 0.000 0.774 0.749 0.736 0.675 0.641 0.575 0.451
w3 0.455 0.774 0.000 0.834 0.720 0.666 0.586 0.736 0.950
w4 0.644 0.749 0.834 0.000 0.816 0.698 0.799 0.635 0.745
w5 0.541 0.736 0.720 0.816 0.000 0.917 0.723 0.624 0.843
w6 0.611 0.675 0.666 0.698 0.917 0.000 0.624 0.726 0.806
w7 0.591 0.641 0.586 0.799 0.723 0.624 0.000 1.000 0.765
w8 0.662 0.575 0.736 0.635 0.624 0.726 1.000 0.000 0.683
w9 0.782 0.451 0.950 0.745 0.843 0.806 0.765 0.683 0.000
276 S. Hirano and S. Tsumoto
w1 w2 w3 w4 w5 w6 w7 w8 w9
w1 0.000 0.268 0.134 0.030 0.399 0.400 0.187 0.187 0.164
w2 0.268 0.000 0.480 0.283 0.447 0.445 0.224 0.224 0.033
w3 0.134 0.480 0.000 0.146 0.470 0.472 0.307 0.307 0.268
w4 0.030 0.283 0.146 0.000 0.407 0.379 0.199 0.201 0.184
w5 0.399 0.447 0.470 0.407 0.000 1.000 0.384 0.477 0.264
w6 0.400 0.445 0.472 0.379 1.000 0.000 0.450 0.411 0.262
w7 0.187 0.224 0.307 0.199 0.384 0.450 0.000 0.372 0.151
w8 0.187 0.224 0.307 0.201 0.477 0.411 0.372 0.000 0.151
w9 0.164 0.033 0.268 0.184 0.264 0.262 0.151 0.151 0.000
it is possible that matching failure occurs. For example, a sine waves containing
n periods and another sine wave containing n + 1 periods never matches; thus
we are basically unable to represent their dissimilarity in a reasonable manner.
More generally, if there is no possible set of segment pairs that satisfies complete
match criterion – original sequence should be formed completely without any
gaps or overlaps by concatenating the segments– within the given universe of
segments, matching will fail and triangular inequality will not be satisfied.
Comparison of w2 and w3 with w1 yielded d(w2 , w3 ) > d(w1 , w2 ) > d(w1 , w3 ),
meaning that the large amplitude induced large dissimilarity. The order was the
same as that of DTW. As mentioned above, these three sequences had completely
the same factors except for rotation angle. Thus rotation angle successfully cap-
tured difference of amplitudes as that difference of shapes. Comparison of w4
with w1 showed that difference of phase was successfully translated to the dis-
similarity. Taking w2 and w3 into account, one may argue that the order of
dissimilarities does not follow the order of dissimilarities produced by DTW. We
consider that this occurred because, due to phase shift, the shape of w4 at the
edge became different from those of w1 , w2 and w3 . Therefore matching was
performed accompanying replacement of segments at higher scale and thus cost
for segment replacement might be added to the dissimilarity.
Comparison of w5 and w6 with w1 yielded d(w5 , w6 ) >> d(w1 , w6 ) > d(w1 , w5 ).
This represents that dissimilarity between w5 and w6 that have completely dif-
ferent trends was far larger than the dissimilarity to w1 that has flat trend.
Dissimilarities d(w1 , w5 ) and d(w1 , w6 ) were not the same due to the charac-
teristics of the test sequences; since w5 had increasing trends and first diverged
towards the vertical origin, it reached to the first inflection points in shorter
route than that in w6 . The order of dissimilarities was the same as that of DTW.
Comparison of w7 and w8 with w1 yielded d(w7 , w8 ) >> d(w1 , w7 ) > d(w1 , w8 ),
meaning that difference on amplitude was accumulated and far larger dissimilar-
ity was assigned to d(w7 , w8 ) than d(w1 , w7 ) or d(w1 , w8 ). Finally, comparison
of w9 with w1 and w2 yielded d(w1 , w9 >> d(w1 , w2 ). This represents that the
change of sharpness was captured as rotation angle. DTW produced quite small
dissimilarities between w1 and w9 because of its one-to-many matching property.
Empirical Comparison of Clustering Methods 277
4.1 Materials
We employed the chronic hepatitis dataset [7], which were provided as a common
dataset for ECML/PKDD Discovery Challenge 2002 and 2003. The dataset con-
tained long time-series data on laboratory examinations, which were collected
at Chiba University Hospital in Japan. The subjects were 771 patients of hep-
atitis B and C who received examinations during 1982 and 2001. We manually
removed sequences for 268 patients because biopsy information was not provided
for them and thus their virus types were not clearly specified. According to the
biopsy information, the expected constitution of the remaining 503 patients were
B / C-noIFN / C-IFN = 206 / 100 / 197, where B, C-noIFN, C-IFN respectively
represent the patients of type B hepatitis, patients of type C hepatitis who did
not receive the interferon therapy, and those of type C hepatitis who received
the interferon therapy. Due to existence of missing examinations, the numbers
of available sequences could be less than 503. The dataset contained a total of
983 laboratory examinations. In order to simplify our experiments, we selected
13 blood tests that are relevant to the liver function: ALB, ALP, G-GL, G-GTP,
GOT, GPT, HGB, LDH, PLT, RBC, T-BIL, T-CHO and TTT. Details of each
examination are available at the URL [7].
Each sequence originally had different sampling intervals from one day to
several years. Figure 6 provides an example, showing a histogram of sampling
intervals of PLT sequences. From preliminary analysis we found that majority
of intervals ranged from one week to one month, and most of the patients re-
ceived examinations on a fixed day of a week. According to this observation, we
300 80
250 70
Number of Patients
Number of Patients
60
200
50
150
40
100
30
50 20
0 10
0 20 40 60 80 100 120 0 200 400 600 800 1000 1200
Intervals of Data Acquisition (Average; Weeks) Number of Data Points After Resampling
Fig. 6. Histogram of the average sam- Fig. 7. Histogram of the number of data
pling intervals of PLT sequences points in resampled PLT sequences
278 S. Hirano and S. Tsumoto
Data subset B
B (204)
(204) C-noIFN
C-noIFN (99)
(99) C-IFN
C-IFN (196)
(196)
Clusters DTW-AL-AHC-B
DTW-AL-AHC-B DTW-CL-AHC-B
DTW-CL-AHC-B
Table 3. Comparison of the number of generated clusters. Each item represents clusters
for Hepatitis B / C-noIFN / C-IFN cases
Fig. 10. Dendrograms for the GPT sequences of type B hepatitis patients. Comparison
Method=DTW. Grouping method = AL-AHC (left), CL-AHC (right)
noIFN sequences were grouped into 3 clusters, and 196 C-IFN sequences were
also grouped into 3 clusters.
DTW and AHCs. Let us first investigate the case of DTW-AHC. Comparison
of DTW-AL-AHC and DTW-CL-AHC implies that the results can be different
if we use different linkage criterion. Figure 10 left image shows a dendrogram
generated from the GTP sequences of type B hepatitis patients using DTW-AL-
AHC. It can be observed that the dendrogram of AL-AHC has an ill-formed
structure like ’chaining’, which is usually observed with single-linkage AHC. For
such an ill-formed structure, it is difficult to find a good point to terminate merg-
ing of the clusters. In this case, the method produced three clusters containing
193, 9 and 1 sequences respectively. Almost all types of sequences were included
in the first, largest cluster and thus no interesting information was obtained.
On the contrary, the dendrogram of CL-AHC shown in the right of Figure 10
demonstrates a well formed hierarchy of the sequences. With this dendrogram the
method produced 7 clusters containing 27, 21, 52, 57, 43, 2, and 1 sequences each.
Figure 11 shows examples of the sequences grouped into the first four clusters.
One can observe interesting features for each cluster. The first cluster contains
sequences that involve continuous vibration of the GPT values. These patterns
may imply that the virus continues to attack the patient’s body periodically. The
second cluster contains very short, meaningless sequences, which may represent
the cases that patients stop or cancel receiving the treatment quickly. The third
cluster contains another interesting pattern: vibrations followed by the flat, low
values. There are two possible cases that represent this pattern. The first case is
that the patients were cured by some treatments or naturally. The second case is
that the patients entered into the terminal stage and there remains no matters
producing GPT. It is difficult to know which of the two cases a sequence belongs
Empirical Comparison of Clustering Methods 281
Fig. 11. Examples of the clustered GPT sequence for type B hepatitis patients. Com-
parison Method=DTW. Grouping method=CL-AHC. Top Left: 1st cluster (27 cases),
Top Right: 2nd cluster (21 cases), Bottom Left: 3rd cluster (52 cases), Bottom Right;
4th cluster (57 cases). Sequences are selected according to MID order
to, until other types of information such as PLT and CHE level are taken into
account. The fourth cluster contains sequence that have mostly triangular shape.
DTW and RC. For the same data, rough set-based clustering method pro-
duced 55 clusters. Fifty five clusters were too many for 204 objects, however,
41 of 55 clusters contained less than 3 sequences, and furthermore, 31 of them
contained only one sequence. This was because of the rough set-based cluster-
ing tends to produce independent, small clusters for objects being intermediate
of the large clusters. Ignoring small ones, we obtained a total of 14 clusters
containing 53, 16, 10, 9, 9 . . . objects each. The largest cluster contained short
sequences quite similarly to the case of DTW-CL-AHC. Figure 12 shows ex-
amples of sequences for the 2nd, 3rd, 4th and 5th major clusters. Because this
method evaluates the indiscernibility degree of objects, each of the generated
clusters contains strongly similar sets of sequences. Although populations in the
clusters are not so large, one can clearly observe the representative of the in-
282 S. Hirano and S. Tsumoto
Fig. 12. Examples of the clustered GPT sequence for type B hepatitis patients. Com-
parison Method=DTW. Grouping method=RC. Top Left: 2nd cluster (16 cases), Top
Right: 3rd cluster (10 cases), Bottom Left: 4th cluster (9 cases), Bottom Right; 5th
cluster (9 cases)
teresting patterns described previously at CL-AHC. For example, the 2nd and
4th clusters contain cases with large vibration. Among them, the latter mostly
contains cases followed by the vibration with the decreased level. The 3rd cluster
contains clear cases of vibration followed by flattened patterns. Sequences in the
5th cluster represent obvious feature of flat patterns without large vibration.
Fig. 13. Dendrograms for the GPT sequences of type C hepatitis patients with IFN
therapy. Comparison Method=MSM. Grouping method = AL-AHC (left), CL-AHC
(right)
ample, suppose we have two sequences, one is a short sequence containing only
one segment and another is a long sequence containing hundreds of segments.
The segments of the latter sequence will not be integrated into one segment until
the scale becomes considerably high. If the range of scales we use does not cover
such a high scale, the two sequences will never be matched. In this case, the
method should return infinite dissimilarity, or a special number that identifies
the failed matching.
This property prevents AHCs from working correctly. CL-AHC will never
merge two clusters if any pair of ’no-match’ sequences exist between them. AL-
AHC fails to calculate average dissimilarity between two clusters. Figure 13
provides dendrograms for GPT sequences of Hepatitis C (with IFN) patients
obtained by using multiscale matching and AHCs. In this experiment, we let
the dissimilarity of ’no-match’ pairs the same as the most dissimilar ’matched’
pairs in order to elude computational problems. The dendrogram of AL-AHC is
compressed to the small-dissimilarity side because there are several pairs that
have excessively large dissimilarities. The dendrogram of CL-AHC demonstrates
that the ’no-match’ pairs will not be merged until the end of the merging process.
For AL-AHC, the method produced 8 clusters. However, similarly to the pre-
vious case, most of the sequences (182/196) were included in the same, largest
cluster and no interesting information was found therein. For CL-AHC, the
method produced 16 clusters containing 71, 39, 28, 12 . . . sequences. Figure
14 provides examples of the sequences grouped into the four major clusters,
respectively. Similar sequences were found in the clusters, however, obviously
dissimilar sequences were also observed in their clusters.
Fig. 14. Examples of the clustered GPT sequence for type C hepatitis patients with
IFN therapy. Comparison Method=MSM. Grouping method=CL-AHC. Top Left: 1st
cluster (71 cases), Top Right: 2nd cluster (39 cases), Bottom Left: 3rd cluster (28
cases), Bottom Right; 4th cluster (12 cases). Sequences are selected according to MID
order
of the sequences grouped into the four major clusters. It can be observed that
the sequences were properly clustered into the three major patterns: continuous,
large vibration (1st cluster), flat after vibration (2nd cluster), and short (3rd
cluster). Although the 4th cluster is likely to be intermediate of them and difficult
to interpret, other results demonstrate the ability of the clustering method for
handling relative proximity, which enables us to elude the problem of no-match
cases occurring with MSM.
5 Conclusions
In this paper we have reported a comparative study about the characteristics
of time-series comparison and clustering methods. First, we examined the basic
characteristics of MSM and DTW using a simple sine wave and its variants.
Empirical Comparison of Clustering Methods 285
Fig. 15. Examples of the clustered GPT sequence for type C hepatitis patients with
IFN therapy. Comparison Method=MSM. Grouping method=RC. Top Left: 1st cluster
(80 cases), Top Right: 2nd cluster (60 cases), Bottom Left: 3rd cluster (18 cases),
Bottom Right; 4th cluster (6 cases). Sequences are selected according to MID order
method. In the future we extend this study to other types of databases and
other types of clustering/comparison methods.
Acknowledgments
This work was supported in part by the Grant-in-Aid for Scientific Research on
Priority Area (2) “Development of the Active Mining System in Medicine Based
on Rough Sets” (No. 13131208) by the Ministry of Education, Culture, Science
and Technology of Japan.
References
1. E. Keogh (2001): Mining and Indexing Time Series Data. Tutorial at the 2001
IEEE International Conference on Data Mining.
2. Chu, S., Keogh, E., Hart, D., Pazzani, M. (2002). Iterative Deepening Dynamic
Time Warping for Time Series. In proceedings of the second SIAM International
Conference on Data Mining.
3. D. Sankoff and J. Kruskal (1999): Time Warps, String Edits, and Macromolecules.
CLSI Publications.
4. S. Hirano and S. Tsumoto (2002): Mining Similar Temporal Patterns in Long
Time-series Data and Its Application to Medicine. Proceedings of the IEEE 2002
International Conference on Data Mining: pp. 219–226.
5. N. Ueda and S. Suzuki (1990): A Matching Algorithm of Deformed Planar Curves
Using Multiscale Convex/Concave Structures. IEICE Transactions on Information
and Systems, J73-D-II(7): 992–1000.
6. F. Mokhtarian and A. K. Mackworth (1986): Scale-based Description and Recogni-
tion of planar Curves and Two Dimensional Shapes. IEEE Transactions on Pattern
Analysis and Machine Intelligence, PAMI-8(1): 24-43
7. URL: http://lisp.vse.cz/challenge/ecmlpkdd2003/
8. B. S. Everitt, S. Landau, and M. Leese (2001): Cluster Analysis Fourth Edition.
Arnold Publishers.
9. S. Hirano and S. Tsumoto (2003): An Indiscernibility-based Clustering Method
with Iterative Refinement of Equivalence Relations - Rough Clustering -. Journal
of Advanced Computational Intelligence and Intelligent Informatics, 7(2):169–177
10. Lowe, D.G (1980): Organization of Smooth Image Curves at Multiple Scales. In-
ternational Journal of Computer Vision, 3:119–130.
11. Z. Pawlak (1991): Rough Sets, Theoretical Aspects of Reasoning about Data.
Kluwer Academic Publishers, Dordrecht.
12. A. P. Witkin (1983): Scale-space Filtering. Proceedings of the Eighth IJCAI,
pp. 1019–1022.
Spiral Mining Using Attributes from
3D Molecular Structures
1 Introduction
The importance of structure-activity relationship (SAR) studies relating chemical
structures and biological activity is well recognized. Early studies used statistical
techniques, and concentrated on establishing quantitative structure activity
relationships involving compounds sharing a common skeleton. However, it is more
natural to treat a variety of structures together, and to identify the characteristic
substructures responsible for a given biological activity. Recent innovations in high-
throughput screening technology have produced vast quantities of SAR data, and
there is an increasing need for new data mining methods to facilitate drug
development.
Several SARs [3, 4] have been analyzed using the cascade model that we
developed [5]. More recently the importance of a “datascape survey” in the mining
process was emphasized in order to obtain more valuable insights. We added several
functions to the mining software of the cascade model to facilitate the datascape
survey [6, 7].
This new method was recently used in a preliminary study of the SAR of the
antagonist activity of dopamine receptors [8]. The resulting interpretations of the rules
were highly regarded by experts in drug design, as they were able to identify some
S. Tsumoto et al. (Eds.): AM 2003, LNAI 3430, pp. 287 – 302, 2005.
© Springer-Verlag Berlin Heidelberg 2005
288 T. Okada, M. Yamakawa, and H. Niitsuma
characteristic substructures that had not been recognized previously. However, the
interpretation process required the visual inspection of supporting chemical structures
as an essential step. Therefore, the user had to be very careful not to miss
characteristic substructures. The resulting information was insufficient for drug
design, as the results often provided fragments so simple that they appear in
multitudes of compounds. In order to overcome this difficulty, we incorporated
further functions in the cascade model, and we tried to improve the expressions for
linear fragments.
Fruitful mining results will never be obtained without a response from active users.
This paper reports an attempt to reflect expert ideas in the process of attribute
generation and selection. These processes really constitute spiral mining, in which
chemists and system developers conduct analyses alternately. We briefly describe the
aims of mining and introduce the mining method used. Fragment generation is
described in Section 3, and attempts to select attributes from fragments are reported in
Section 4. Typical rules and their interpretations for dopamine D1 agonists are given
in Section 5.
2.1 Aims and Data Source for the Dopamine Antagonists Analysis
Dopamine is a neurotransmitter in the brain. Neural signals are transmitted via the
interaction between dopamine and proteins known as dopamine receptors. There are
six different receptor proteins, D1 – D5, and the dopamine autoreceptor (Dauto), each
of which has a different biological function. Their amino acid sequences are known,
but their 3D structures have not been established.
Certain chemicals act as agonists or antagonists for these receptors. An agonist
binds to the receptor, and it in turn stimulates a neuron. Conversely, an antagonist
binds to the receptor, but its function is to occupy the binding site and to block the
neurotransmitter function of a dopamine molecule. Antagonists for these receptors
might be used to treat schizophrenia or Parkinson’s disease. The structural
characterization of these agonists and antagonists is an important step in developing
new drugs.
We used the MDDR database (version 2001.1) of MDL Inc. as the data source. It
contains about 120,000 drug records, including 400 records that describe dopamine
(D1, D2, and Dauto) agonist activities and 1,349 records that describe dopamine (D1,
D2, D3, and D4) antagonist activities. Some of the compounds affect multiple
receptors. Some of the compounds contain salts, and these parts were omitted from
the structural formulae. The problem is to discover the structural characteristics
responsible for each type of activity.
The cascade model can be considered an extension of association rule mining [5]. The
method creates an itemset lattice in which an [attribute: value] pair is used as an item
to constitute itemsets. Links in the lattice are selected and interpreted as rules. That is,
we observe the distribution of the RHS (right hand side) attribute values along all
Spiral Mining Using Attributes from 3D Molecular Structures 289
links, and if a distinct change in the distribution appears along some link, then we
focus on the two terminal nodes of the link. Consider that the itemset at the upper end
of a link is [A: y] and item [B: n] is added along the link. If a marked activity change
occurs along this link, we can write the rule:
Cases: 200 ==> 50 BSS=12.5
IF [B: n] added on [A: y]
THEN [Activity]: .80 .20 ==> .30 .70 (y n)
THEN [C]: .50 .50 ==> .94 .06 (y n)
Ridge [A: n]: .70 .30/100 ==> .70 .30/50 (y n)
where the added item [B: n] is the main condition of the rule, and the items at the
upper end of the link ([A: y]) are considered preconditions. The main condition
changes the ratio of the active compounds from 0.8 to 0.3, while the number of
supporting instances decreases from 200 to 50. BSS means the between-groups sum of
squares, which is derived from the decomposition of the sum of squares for a
categorical variable. Its value can be used as a measure of the strength of a rule. The
second “THEN” clause indicates that the distribution of the values of attribute [C]
also changes sharply with the application of the main condition. This description is
called the collateral correlation, which plays a very important role in the
interpretation of the rules.
Three new points were added to DISCAS, the mining software for the cascade model.
The main subject of the first two points is decreasing the number of resulting rules
[6]. A rule candidate link found in the lattice is first greedily optimized in order to
give the rule with the local maximum BSS value, changing the main and pre-
conditions. Let us consider two candidate links: (M added on P) and (M added on P').
Here, their main conditions, M, are the same. If the difference between preconditions
P and P' is the presence/absence of one precondition clause, the rules starting from
these links converge on the same rule expression, which is useful for decreasing the
number of resulting rules.
The second point is the facility to organize rules into principal and relative rules. In
the association rule system, a pair of rules, R and R', are always considered
independent entities, even if their supporting instances overlap completely. We think
that these rules show two different aspects of a single phenomenon. Therefore, a
group of rules sharing a considerable number of supporting instances are expressed as
a principal rule with the largest BSS value and its relative rules. This function is useful
for decreasing the number of principal rules to be inspected, and to indicate the
relationships among rules. Further, we omit rules, if most of their supporting instances
are covered by the principal rule and if its distribution change in the activity is less
than that of the principal rule.
The last point provides ridge information for a rule [7]. The last line in the
aforementioned rule contains ridge information. This example describes [A: n], the
ridge region detected, and the change in the distribution of “Activity” in this region.
Compared with the large change in the activity distribution for the instances with [A:
y], the distribution does not change on this ridge. This means that the BSS value
290 T. Okada, M. Yamakawa, and H. Niitsuma
decreases sharply if we expand the rule region to include this ridge region. This ridge
information is expected to guide the survey of the datascape.
Figure 1 shows a brief scheme of the analysis. The structural formulae of all
chemicals are stored in the SDF file format. We used two kinds of explanation
attributes generated from the structural formulae of chemical compounds:
physicochemical properties and the presence/absence of many linear fragments. We
used four physicochemical estimates: the HOMO and LUMO energy levels, the
dipole moment, and LogP. The first three values were estimated using molecular
mechanics and molecular orbital calculations using the MM-AM1-Geo method
provided in Cache. LogP values were calculated using the program ClogP in
Chemoffice. The scheme used for fragment generation is discussed in the following
section.
The number of fragments generated is huge, and some of them are selected to form
the data table shown in the center of Fig. 1. The cascade model is applied to analyze
data, and the resulting rules are interpreted. Often, chemists will browse the rules and
try to make hypotheses, but they rarely reach a reasonable hypothesis from the rule
expression alone. Spotfire software can then be used, as it can visualize the structural
formulae of the supporting compounds when the user specifies the rule conditions.
Figure 2 shows a sample window from Spotfire. Here, the attributes appearing in a
rule are used as the axes of pie and bar charts, and the structural formulae with the
specified attribute values are depicted on the left side by clicking a chart. A chemist
might then find meaningful substructures from a focused set of compound structures.
This function was found to be essential in order to evoke active responses from a
chemist.
The use of linear fragments as attributes has been described in a previous paper [3].
Klopman originally proposed this kind of scheme [1]. Kramer developed a similar
kind of scheme independently from our work [2].
Figure 2 shows an example of a structural formula and a set of linear fragment
patterns derived from it. Hydrogen atoms are regarded as attachments to heavy atoms.
Here, every pattern consists of two terminal atoms and a connecting part along the
shortest path. For example, the pattern at the bottom of the left column uses atoms
<1> and <6> as terminals, and the connecting part is described by “=C–C–C–”, which
shows the sequence of bond types and element symbols along the path, <1>=<2>–
<3>–<5>–<6>. An aromatic bond is denoted as “r”. The description of a terminal
atom includes the coordination number (number of adjacent atoms), as well as
whether there are attached hydrogen atoms.
In this example, we require that at least one of the terminal atoms be a heteroatom
or an unsaturated carbon atom. Therefore, in Fig. 3, no fragment appears between
tetrahedral carbon atoms <3> and <5>. Fragments consisting of a single heavy atom
like C3H and O2H have been added to the items, although they are not shown in the
figure. The set of these fragments can be regarded as constituting a kind of fingerprint
of the molecule.
The fragment patterns in Fig. 3 are just examples. The coordination number and
hydrogen attachments in the terminal atom part may be omitted. The connecting part
is also subject to change. For example, we can just use the number of intervening
bonds as the connecting part. Conversely, we can add coordination numbers or the
hydrogen attachment to the atom descriptions in the connecting part. There is no a
292 T. Okada, M. Yamakawa, and H. Niitsuma
priori criterion to judge the quality of various fragment expressions. Only after the
discovery process can we judge which type of fragment expression is useful.
This fragment expression led to the discovery of new insights in the application to
mutagenesis data. However, a preliminary examination of the rules resulting from
dopamine antagonists was not as promising. For example, an aromatic ether
substructure was found to be effective in D2 antagonist activity. This substructure
appears in many compounds, and chemists concluded that the rules were not useful.
Further, they suggested several points to improve the readability of the rules.
Therefore, we decided to incorporate new features into the expression of fragments,
as discussed in the following sections.
A B A N N B A B
When we describe all the element symbols in the fragment, the number of
supporting compounds for a fragment decreases and none of the three fragments
might be used as the attribute. If we express the connecting part using the number of
intervening atoms, then all these structures contain the fragment “A-<4>-B”.
However, chemists rejected this expression, as it does not evoke structural formulae
in their mind. Therefore, we tried to introduce a new mechanism that omits element
and bond symbols from fragment expressions, while keeping the image of structural
formula. The proposed fragment expression takes the following forms when applied
to the above three structures.
Spiral Mining Using Attributes from 3D Molecular Structures 293
O H
We judge the hydrogen bond XH…Y to exist when the following conditions are all
satisfied.
1. Atom X is O, N, S or a 4-coordinated C with at least one hydrogen atom.
2. Atom Y is O, N, S, F, Cl or Br.
3. The distance between X and Y is less than 3.7 Å if Y is O, N or F; and it is less
than 4.2 Å otherwise.
4. The structural formula does not contain fragments X-Y or X-Z-Y, including any
bond type between atoms.
When these conditions are satisfied, we generate fragments: Xh.Y, V-Xh.Y, Xh.Y-
W, and V-Xh.Y-W, where “h.” denotes a hydrogen bond and neighboring atoms V
and W are included. Other notations follow the basic scheme. The fragments derived
from the sample structure in Fig. 4 are as follows.
O2Hh.n2 , c3-O2Hh.n2 , O2Hh.n2:c3 , O2Hh.n2:c3H ,
c3-O2Hh.n2:c3, c3-O2Hh.n2:c3H
where we use the notation for the aromatic ring explained in the next section.
Application to the dopamine antagonist dataset resulted in 431 fragments, but the
probability of the most frequent fragment was less than 0.1. Therefore, no hydrogen
bond fragments were used in the standard attribute selection process.
So far, the element symbol of all carbon atoms has been expressed by the capital letter
“C”. Consider molecule I, shown below. Its fragments contain C3-C3-N3H and C3-
C3=O1, and we cannot judge whether these C3 atoms are in an aromatic ring or are
part of a double bond. It is important to know whether an atom is in an aromatic ring.
Therefore, we introduced a mechanism to indicate aromatic rings; we describe
aromatic atoms in lowercase letters. The above fragments are now written c3-C3-N3H
and c3-C3=O1, so chemists can easily recognize the molecular environment of the
fragment.
O
c C
H
I
Further, we changed the bond notation for an aromatic ring from “r” to “:”, as the
latter is used in SMILES notation, a well-known chemical structure notation system.
If these fragments appear in a rule, chemists will wonder whether they constitute an
amide group like I, or if they appear in separate positions in the molecule.
In order to overcome this difficulty, we introduced a united atom symbol “CO” for
a carbonyl group with two neighboring atoms. Then, the above fragment can be
written as c3-CO2-N3H. Chemists felt that this expression improves the readability of
rules greatly. A similar notation was used in the CASE system adopted by Klopman
[1].
bond fragments for which the probability of appearance was > 0.02, as well as five
other fragments: N3-c3:c3-O2, N3H-c3:c3-O2, N3-c3:c3-O2H, N3H-c3:c3-O2H,
O1=S4.
The strongest rule indicating active compounds takes the following form, unless
we add 32 fragments as attributes.
IF [C4H-C4H-O2: y] added on [ ]
THEN D2AN: 0.32 ==> 0.62 (on)
THEN c3-O2: 0.42 ==> 0.89 (y)
THEN c3H:c3-O2: 0.33 ==> 0.70 (y)
Ridge [c3H:c3H::c3-N3: y] 0.49/246 --> 0.92/71 (on)
No preconditions appear, and the main condition shows that an oxygen atom bonded
to an alkyl carbon is important. However, this finding is so different from the
common sense of chemists, and it will never be accepted as a useful suggestion. In
fact, the ratio of active compounds is only 62%. Collateral correlations suggest that
the oxygen atom is representative of aromatic ethers, and the ridge information
indicates the relevance of aromatic amines. Nevertheless, it is still difficult for an
expert to make a reasonable hypothesis.
Chemists found a group of compounds sharing the skeleton shown below, on
browsing the supporting structures. Therefore, they added fragments consisting of two
aromatic carbons bonded to N3 and O2. This addition did not change the strongest
rule, but the following new relative rule appeared.
O N Ar
IF [N3-c3:c3-O2: y] added on [ ]
THEN D2AN: 0.31 ==> 0.83 (on)
THEN HOMO: .16.51.33 ==> .00.19 .81 (low medium high)
THEN c3H:c3---C4H-N3: 0.24 --> 0.83 (y)
This rule has a greater accuracy and it explains about 20% of the active
compounds. The tendency observed in the HOMO value also gives us a useful insight.
However, the collateral correlation on the last line suggests that most compounds
supporting this rule have the skeleton shown above. Therefore, we cannot exclude the
possibility that other parts of this skeleton are responsible for the activity.
N
N
H
O
We changed the direction of the mining scheme so that it used more attributes in
mining to remove the burden placed on chemists in the first attempt. A correlation-
based approach for attribute selection was introduced. This new scheme enabled the
use of several hundred fragments as attributes. The details of the attribute-selection
scheme will be published separately [9].
When we applied the new scheme to agonist data, 4,626 fragments were generated.
We omitted a fragment from the attribute set, unless the probability of its appearance
satisfied the condition: 0.03 < P(fragment) < 0.97. This gave 660 fragments as
candidates of attributes. Then, we omitted a fragment in a correlated pair, if the
correlation coefficient was greater than 0.9. The fragment with the longer string was
298 T. Okada, M. Yamakawa, and H. Niitsuma
We used 345 fragments and four physicochemical properties to construct the itemset
lattice. The thres parameter value was set to 0.125, and this controls the details of the
lattice search [5]. The resulting lattice contained 9,508 nodes, and we selected 1,762
links (BSS > 2.583 = 0.007*#compounds) as rule candidates in the D1 agonist
analysis. Greedy Optimization of these links resulted in 407 rules (BSS > 5.535 =
0.015*#compounds). Organization of these rules gave us 14 principal rules and 53
relative rules. Many rules indicated characteristics leading to inactive compounds or
had few supporting compounds, and we omitted those rules with an activity ratio <
0.8 and those with #compounds < 10 after applying the main condition. The final
number of rules we inspected decreased to two principal and 14 relative rules.
We inspected all the rules in the final rule set, browsing the supporting chemical
structures using Spotfire. Table 1 summarizes the important features of valuable rules
that guided us to characteristic substructures for D1 agonist activity.
R1 was the strongest rule derived in the D1 agonist study. There are no
preconditions, and the activity ratio increases from 17% in 369 compounds to 96% in
52 compounds by including the catechol structure (O2H-c3:c3-O2H). This main
condition reduces Dauto activity to 0%. Other collateral correlations suggest the
existence of N3H-C4H-C4H-c3, and OH groups are thought to exist at the positions
meta and para to this ethylamine substituent. However, this ethylamine group is not
an indispensable substructure, as the appearance of the N3H-C4H-C4H-c3 fragment is
limited to 81% of the supporting compounds.
This observation is also supported by the ridge information. That is, the selection
of a region inside the rule by adding a condition [N3H-C4H--:::c3-O2H: y] results in
53 (35 active, 18 inactive) and 35 (34 active, 1 inactive) compounds before and after
the application of the main condition, respectively. This means that there are 1 active
and 17 inactive compounds when the catechol structure does not exist, and we can say
that N3H-C4H--:::c3-O2H cannot show D1Ag activity without catechol. Therefore,
we can hypothesize that for D1Ag activity catechol is the active site and it is
supported by a primary or secondary ethylamine substituent at the meta and para
positions as the binding site.
Spiral Mining Using Attributes from 3D Molecular Structures 299
HO
D1-A
Fourteen relative rules are associated with principal rule R1. Some of them use
[N3H-C4H--::c3-O2H: y] and [O2H-c3:c3: y] as the main condition with a variety of
preconditions. Another relative rule has the main condition [N3: n] and preconditions
[C4H-O2-c3: n] and [C4H-C4H----:::c3-O2H: n]; characterization depending on the
300 T. Okada, M. Yamakawa, and H. Niitsuma
HO NH
HO
D1-B
The only exceptional relative rule was R1-UL9, which is shown as the second entry
in Table 1. The interesting points of this rule are the 100% co-appearance of the D2
agonist activity, as well as the tert-amine structure in the main condition. These points
make a sharp contrast to those found in R1, where prim- and sec-amines aid the
appearance of D1Ag activity and D2Ag activity was found in 38% of the 52
compounds. The importance of the prim- or sec-amines, ethylamine substituent, and
catechol structure are also suggested by the precondition and collateral correlations.
Inspection of the supporting structures showed that this rule was derived from
compounds with skeleton D1-C. We found a dopamine structure around the phenyl
ring at the right in some compounds, but it could not explain the D1Ag activity for all
the supporting compounds. Therefore, we proposed a new hypothesis that the active
site is the catechol in the left ring, while the binding site is the sec-amine in the
middle of the long chain. This sec-amine can locate itself close to the catechol ring by
folding the (CH4)n (n=6, 8) chain and it will act as the ethylamine substituent
in D1-A.
H
N N Ph
n m
HO m = 2, 3, 4
n = 6, 8
OH
D1-C
R14 is the second and final principal rule leading to D1Ag activity. Contrary to the
R1 group rules, no OHs substituted to an aromatic ring played an essential role in this
rule. It was difficult to interpret this rule, because the main condition and second
precondition are designated by the absence of ether and tert-amine substructures.
However, we found that 6 out of 11 compounds share the skeleton D1-D, where the
vicinal OHs in catechol are transformed into esters. These esters are thought to be
hydrolyzed to OH in the absorption process, and these compounds act as pro-drugs to
provide the D1-A skeleton.
Spiral Mining Using Attributes from 3D Molecular Structures 301
O NH2
O
D1-D
6 Conclusion
The development of the cascade model along with improvements in linear fragment
expressions led to the successful discovery of the characteristic substructure of D1
agonists. The proposed hypothesis bears a close resemblance to the dopamine
molecule, and is not a remarkable one. Nevertheless, the hypotheses are rational and
they have not been published elsewhere. Other agonists and antagonists are now being
analyzed.
Close collaboration between chemists and system developers has been the key to
our final success. In this project, the chemist had expertise in using computers in SAR
study, while the system developer had a previous career in chemistry. These
backgrounds enabled efficient, effective communication between the members of the
research group. The chemist would briefly suggest points to improve, and the
developer could accept the meaning without rational, detailed explanations by the
chemist. The developer often tried to analyze the data in the preliminary analysis, and
sometimes he could judge the shortcomings of his system before consulting the
chemist. Therefore, the number of spirals in the system development process cannot
be estimated easily. Such collaboration is not always possible, but we should note that
the communication and trust among the persons involved might be a key factor in the
success in this sort of mining.
Acknowledgements
The authors thank Ms. Naomi Kamiguchi of Takeda Chemical Industries for her
preliminary analysis on dopamine activity data. This research was partially supported
by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Scientific
Research on Priority Areas, 13131210 and by Grant-in-Aid for Scientific Research
(A) 14208032.
References
1 Klopman, G.: Artificial Intelligence Approach to Structure-Activity Studies. J. Amer.
Chem. Soc. 106 (1985) 7315-7321
2 Kramer, S., De Raedt, L., Helma, C.: Molecular feature mining in HIV data. In: Proc. of the
Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining (KDD-01) (2001) 136-143
3 Okada, T.: Discovery of Structure Activity Relationships using the Cascade Model: the
Mutagenicity of Aromatic Nitro Compounds. J. Computer Aided Chemistry 2 (2001) 79-86
302 T. Okada, M. Yamakawa, and H. Niitsuma
1 Introduction
For half a century, a lot of efforts have been made to develop new drugs. It is true that
such new drugs allow us to have better life. However, serious side effects have often
been reported and raise a social problem. The aim of this research project is to estab-
lish a basis of computer-aided risk report for chemicals based on pattern recognition
techniques and chemical similarity analysis.
The authors [1] proposed Topological Fragment Spectral (TFS) method for a nu-
merical vector representation of the topological structure profile of a molecule. The
TFS provides us a useful tool to evaluate structural similarity among molecules [2]. In
our preceding work [3], we reported that an artificial neural network approach com-
bined with input signals of the TFS allowed us to successfully classify type of activi-
ties for dopamine receptor antagonists, and it could be applied to the prediction of
active class of unknown compounds. And we also suggested that similar structure
searching based on TFS representation of molecules could provide us a chance to
discover new insight or knowledge from a huge amount of data [4].
On the other hand, in the past few years, support vector machines (SVM) have at-
tracted great interest in machine learning due to its superior generalization ability in
various learning problems [5-7]. Support vector machine is originated from percep-
tron theory, but seldom causes some classical problems such as multiple local min-
ima, curse of dimensionality and over-fitting in artificial neural network. Here we
investigate the utility of support vector machine combined with TFS representation of
chemical structures in classifying pharmacological drug activities.
S. Tsumoto et al. (Eds.): AM 2003, LNAI 3430, pp. 303 – 311, 2005.
© Springer-Verlag Berlin Heidelberg 2005
304 Y. Takahashi, K. Nishikoori, and S. Fujishima
2 Methods
S(e)=1 S(E)=3
S(e)=2
4 7
5 6
7
4 3 5 6
7
6
33
22
11
11 2 2 3 3 4 4 5 5 6 6 7 7 8 8
Sumofof
Sum the
the degrees
degrees of of
thethe nodes
nodes
XX==(3,
(3,1,
1,2,
2,2,
2,,2,
,2,3,3,3,3,1)
1)
Fig. 1. A schematic flow of TFS generation. S(e) is the number of edges (bonds) of generated
fragments
To get TFS representation of a chemical structure, all the possible subgraphs with
specified number of edges are enumerated. Subsequently, every subgraph is charac-
terized with a numerical quantity defined in a characterization scheme in advance. In
Fig.1 the sum of degree of vertex atoms is used as the characterization index of each
subgraph. Alternative method can be also employed for the characterization such as
the sum of mass numbers of the atoms (atomic groups) corresponding to the vertices
Classification of Pharmacological Activity of Drugs Using Support Vector Machine 305
of the subgraph. The histogram is defined as TFS obtained from frequency distribu-
tion of individually characterized subgraphs (i.e. substructures or structural frag-
ments) according to their values of the characterization index. TFS of promazine
characterized by the two different ways are shown in Fig. 2.
For the present work we used the latter characterization scheme to generate the
TFS. In the characterization process, suppressed hydrogen atoms are considered as
augmented atoms.
(a)
S N
(b)
Fig. 2. TFS of promazine generated by the different characterization methods. (a) is obtained
by the characterization with the sum of degrees of vertex atoms of the fragments. (b) is
obtained by another characterization scheme with the sum of atomic mass numbers of the
fragments
Where, xik is the intensity value of peak k of TFS for i-th molecule, and x jk is that of
peak k of TFS for the j-th molecule that have the highest value of the characterization
index (in this work, the highest fragment mass number). For the prediction, each TFS
is adjusted by padding with 0 or by cutting the higher mass region off to have the
same dimensionality as that of the training set when a prediction sample is submitted.
Support vector machine has been focused as a powerful nonlinear classifier in the last
decade because it introduces kernel function trick [7]. The SVM implements the fol-
lowing basic idea: it maps the input vectors x into a higher dimensional feature space
z through some nonlinear mapping, chosen a priori. In this space, an optimal dis-
criminant surface with maximum margin is constructed (Fig.2). Given a training data-
set represented by X(x1 ,..., xi ,..., x n ) , x i that are linearly separable with class labels
yi ∈ {−1,1}, i = 1,..., n , the discriminant function can be described in the following
equation.
f (xi ) = (w ⋅ xi ) + b (2)
Where w is a weight vector, b is a bias. The discriminant surface can be represented
as f (x i ) = 0 . The maximum margin can be obtained by minimizing square of the
norm of weight vector w ,
2 d
w = w ⋅ w = ∑ wl2 (3)
l =1
This optimization problem reduces to the previous one for separable data when con-
stant C is large enough. This quadratic optimization problem with constraints can be
reformulated by introducing Lagrangian multipliers α .
1 n n
W (α ) = ∑ α iα j yi y j x i ⋅ x j − ∑α i (5)
2 i , j =1 i =1
n
with the constraints 0 ≤ α i ≤ C and ∑ α i yi = 0 .
i =1
Classification of Pharmacological Activity of Drugs Using Support Vector Machine 307
Since the training points x i do appear in the final solution only via dot products, this
formulation can be extended to general nonlinear functions by using the concepts of
nonlinear mappings and kernels [6]. Given a mapping, x → φ (x) , the dot product in
the final space can be replaced by a kernel function.
n
f (x) = g (φ (x)) = ∑ α i yi K (x, x i ) + b (6)
i =1
Here we used radial basis function as the kernel function for mapping the data into a
higher dimensional space.
⎛ x − x′ 2 ⎞
K ( x, x′) = exp⎜ − ⎟ (7)
⎜ σ 2 ⎟
⎝ ⎠
An illustrative scheme of nonlinear separation of data space by SVM is shown in
Fig.3.
Basically, SVM is a binary classifier. For classification problem of three or more
categorical data, plural discrimination functions are required for the current multi
categorical classification. In this work, one-against-the-rest approach was used for
the case. The TFS were used as input feature vectors to the SVM. All the SVM
analyses were carried out using a computer program developed by the authors accord-
ing to Platt’s algorithm [8].
B Nonlinear
B
mapping A Remapping
B
A A
A B
A A
B
A A A
B B B B
B B
Data space B Data space
Higher dimensional data
space, (x)Φ
Fig. 3. Illustrative scheme of nonlinear separation of two classes by SVM with a kernel trick
propagation method. All the neural network analyses in this work were carried out using
a computer program, NNQSAR, developed by the authors [9].
In this work we employed 1364 dopamine antagonists that interact with four different
types of receptors (D1, D2, D3 and D4 receptors of dopamine). Dopamine is one of
representative neurotransmitters. Decreasing of dopamine causes to various neuropa-
thy. It’s closely related to Parkinson’s disease. Several G protein-coupled receptors,
GPCR(s) are known. On the contrary, Dopamine antagonists interact to dopamine
receptors and then inhibit the role of dopamine. The data are taken from MDDR [10]
database that is a structure database of investigative new drugs, and they are all the
data of dopamine receptor antagonists that are available on it. The data set was di-
vided into two groups; training set and prediction set. The training set consists of
1227 compounds (90% of the total data), and the prediction set consists of 137 com-
pounds (10% of the total data) remained. They were randomly chosen.
110
学習時分類率
Recognition
1300
予測率
Prediction
1200
100
サポート
Numberベクター
of SV 数
1100
Recognition Rate (%)
90
Support Vector
1000
80 900
70 800
700
60
600
50 500
40 400
1 10 100 1000 10000
σ
Fig. 4. Variation of the recognition rate, prediction rate and support vectors
Classification of Pharmacological Activity of Drugs Using Support Vector Machine 309
Dopamine antagonists that interact with different type of receptors (D1, D2, D3 and
D4) were used for training SVM with their TFS to classify the type of activity. SVM
model was able to learn to classify all the compounds into own active classes cor-
rectly. However, the prediction ability is sensitive to the value of σ in the kernel
function used. For the reason, we investigated the optimum value of σ parameter in
preliminary computational experiments. Figure 4 shows plots of the performances
with different values of the σ parameter. The plots show that the small values of σ
gave us better training results but they gave us poor results for the predictions. It is
clear that there is an optimum for the value. For the present case, it was concluded the
optimum value 40.
Then, the trained SVM model was used to predict activity for unknown com-
pounds. For 137 separately prepared compounds in advance, the activity classes of
123 compounds (89.8%) were correctly predicted. The results are summarized in
Table 1. Out of the four classes, the prediction result for D4 receptor antagonists is
better than the others in both of the training and the prediction. Supposedly it obtains
well defined support vectors from the training set with a lot more samples than the
rest. These results show that SVM provides us a very powerful tool for classification
and prediction of pharmaceutical drug activities, and that TFS representation is suit-
able as input signal to SVM in the case.
Training Prediction
Class Data %Correct Data %Correct
All 1227 100 123/137 89.8
D1 155 100 15/18 83.3
D2 356 100 31/39 79.5
D3 216 100 22/24 91.7
D4 500 100 55/56 98.2
In the previous section, the TFS-based support vector machine (TFS/SVM) success-
fully classified and predicted most of the dopamine antagonists into their own active
classes. We also tested the validity of the results using a cross-validation technique.
For the same data set of 1364 compounds that contain both the training set and predic-
tion set, ten different trial datasets were generated. For making these trial datasets the
whole data was randomly divided into 10 subsets that had almost same number of
samples for (136 or 137 compounds for each). Each trial dataset was used as the test
set for the prediction and the remaining of the datasets were employed for the training
of the TFS/SVM. The results of these computational trials were summarized in
Table 2.
310 Y. Takahashi, K. Nishikoori, and S. Fujishima
In the cross validation test, the total prediction rates for ten trials of the training
were 87.6 % to 93.4 %. The total average of prediction rates was 90.6 %. These val-
ues are quite similar to them described in the previous section. It is considered that
the present results strongly validate the utility of the TFS-based support vector ma-
chine approach to the problems for classification and prediction of pharmacologically
active compounds.
In the preceding work [2], the authors reported that an artificial neural network based
on TFS gives us a successful tool to discriminate active drug classes. To evaluate
better performance of SVM approach for the current problem, here, we tried to com-
pare the results by SVM with those by artificial neural network (ANN). The data set
of 1364 drugs used in the above section was employed for the analysis as well. Ten-
fold cross validation technique was used for the computational trial. The results were
summarized in Table 2.
Table 3. Comparison between SVM and ANN by ten-fold cross validation test
SVM ANN
Active Training Prediction Training Prediction
Class %Correct %Correct %Correct %Correct
All 100 90.6 87.5 81.1
D1 100 87.5 76.0 70.7
D2 100 86.1 80.7 69.9
D3 100 88.3 90.9 85.8
D4 100 95.5 94.5 90.5
The table shows that the results obtained by SVM were better than those obtained
by ANN for every case in these trials. These results show that TFS-based support
vector machine was more successful results than TFS-based artificial neural network
for the current problem.
Classification of Pharmacological Activity of Drugs Using Support Vector Machine 311
References
1. Y. Takahashi, H. Ohoka, and Y. Ishiyama, Structural Similarity Analysis Based on Topo-
logical Fragment Spectra, In “Advances in Molecular Similarity”, 2, (Eds. R. Carbo & P.
Mezey), JAI Press, Greenwich, CT, (1998) 93-104
2. Y. Takahashi, M. Konji, S. Fujishima, MolSpace: A Computer Desktop Tool for Visuali-
zation of Massive Molecular Data, J. Mol. Graph. Model., 21 (2003) 333-339
3. Y. Takahashi, S. Fujishima and K. Yokoe: Chemical Data Mining Based on Structural
Similarity, International Workshop on Active Mining, The 2002 IEEE International Con-
ference on Data Mining, Maebashi (2002) 132-135
4. Y. Takahashi, S. Fujishima, H. Kato, Chemical Data Mining Based on Structural Similar-
ity, J. Comput. Chem. Jpn., 2 (2003) 119-126
5. V.N. Vapnik : The Nature of Statistical Learning Theory, Springer, (1995)
6. C. J. Burges,. A Tutorial on Support Vector Machines for Pattern Recognition, Data Min-
ing and Knowledge Discovery 2, (1998) 121-167
7. S. W. Lee and A. Verri, Eds, Support Vector Machines 2002, LNCS 2388, (2002)
8. J. C. Platt : Sequential Minimal Optimization: A Fast Algorithm for Training Support
Vector Machines, Microsoft Research Tech. Report MSR-TR-98-14, Microsoft Research,
1998.
9. H. Ando and Y. Takahashi, Artificial Neural Network Tool (NNQSAR) for Structure-
Activity Studies, Proceedings of the 24th Symposium on Chemical Information Sciences
(2000) 117-118
10. MDL Drug Data Report, MDL, ver. 2001.1, (2001)
Cooperative Scenario Mining from Blood
Test Data of Hepatitis B and C
S. Tsumoto et al. (Eds.): AM 2003, LNAI 3430, pp. 312 – 335, 2005.
© Springer-Verlag Berlin Heidelberg 2005
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 313
2003], i.e., an event or a situation significant for decision making, a chance occurs at
the cross point of multiple scenarios as in the example above because a decision is to
select one scenario in the future. Generally speaking, a set of scenarios form a basis of
decision making, in domains where the choice of a sequence of events affects the
future significantly. Based on this concept, the methods of chance discovery have
been making successful contributions to science and business domains [Chance
Discovery Consortium 2004].
Now let us stand on the position of a surgeon looking at the time series of
symptoms during the progress of an individual patient’s disease. The surgeon should
make appropriate actions for curing this patient, at appropriate times. If he does so,
the patient’s disease may be cured. However, otherwise the patient’s health condition
might be worsened radically. The problem here can be described as choosing one
from multiple scenarios. For example, suppose states 4 and 5 in Eq. (1) mean two
opposite situations.
Scenario 1 = {state0 -> state 1 -> state2 -> state3 -> state4 (a normal condition)}.
Scenario 2 = {state 4 -> state5 -> state6 (a fatal condition)}. (1)
Each event-sequence in Eq.(1) is called a scenario if the events in it share some
common context. For example, Scenario 1 is a scenario in the context of cure, and
Scenario 2 is a scenario in the context of disease progress. Here, suppose there is a
hidden state 11, which may come shortly after or before state 2 and state 5. The
surgeon should choose an effective action at the time of state 2, in order to turn this
patient to state 3 and state 4 rather than to state 5, if possible. Such a state as state 2,
essential for making a decision, is a chance in this case.
Detecting an event at a crossover point among multiple scenarios, as state 2 above,
and selecting the scenario going through such a cross point means a chance discovery.
In general, the meaning of a scenario with an explanatory context is easier to
understand than an event shown alone. In the example of the two scenarios above, the
scenario leading to cure is apparently better than the other scenario leading to a fatal
condition. However, the meaning of chance events, which occurs on the bridge from a
normal scenario to a fatal scenario, i.e., state 2, state 11, and state 5 in Fig.1, are hard
to understand if they are shown independently of more familiar events. For example,
if you are a doctor and find polyp is in a patient’s stomach, it would be hard to decide
to cut it away or to do nothing else than leaving it at the current position. On the other
hand, suppose the you find the patient is at the turning point of two scenarios – in one,
the polyp will turn larger and gets worsened. In the other, the polyp will be cut away
and the patient will be cured. Having such understanding of possible scenarios, you
can easily choose the latter choice.
Consequently, an event should be regarded as a valuable chance if the difference of
the merits of scenarios including the event is large, and this difference is an easy
measure of the utility of the chance. Discovering a chance and taking it into
consideration is required for making useful scenarios, and proposing a number of
scenarios even if some are useless is desired in advance for realizing chance
discovery. For realizing these understandings, visualizing the scenario map showing
the relations between states as in Fig.1 is expected to be useful. Here, let us each
familiar scenario, such as Scenario 1 or Scenario 2 , an island. And, let us call the link
314 Y. Ohsawa et al.
between islands a bridge. In chance discovery, the problem is to have the user obtain
bridges between islands, in order to explain the meaning of the connections among
islands via bridges, as a scenario expressed in understandable language for the user
him/herself.
Fig. 1. A chance existing at the cross point of scenarios. The scenario in the thick arrows
emerged from Scenario 1 and Scenario 2
mind. The new combination of proposed scenarios, made during the arrangement and
the rearrangements of KJ cards, helps the emergence of new valuable scenarios,
putting in our terminology. In some design processes, on the other hand, it has been
pointed out that ambiguous information can trigger creations [Gaver et al 2003]. The
common points among the scenario “workshop”, the “combination” of ideas in KJ
method, and the “ambiguity” of the information to a designer is that scenarios
presented from the viewpoint of each participant’s environment, are bridged via
ambiguous pieces of information about different mental worlds they attend. From
these bridges, each participant recognizes situations or events which may work as
“chances” to import others’ scenarios to get combined with one’s own. This can be
extended to other domains than designing. In the example of Eq.(1), a surgeon who
almost gave up because he guessed his patient is in Scenario 2, may obtain a new
hope in Scenario 1 proposed by his colleague who noticed that state 2 is common to
both scenarios – only if it is still before or at the time of state 2. Here, state 2 is
uncertain in that its future can potentially go in two directions, and this uncertainty
can make a chance, an opportunity not only a risk.
In this paper we apply a method for aiding scenario emergence, by means of
interaction with real data using two tools of chance discovery, KeyGraph in [Ohsawa
2003b] and TextDrop [Okazaki 2003]. Here, KeyGraph with additional causal
directions in the co-occurrence relations between values of variables in blood-test
data of hepatitis patients (let us call this a scenario map), and TextDrop helps in
extracting the part of data corresponding to the concern of an expert, a surgeon and a
physician here.
These tools aid in obtaining useful scenarios of the hepatitis progress and cure,
reasonably restricted to an understandable type of patience, from the complex real
data taken from the mixture of various scenarios. The scenarios obtained for hepatitis
were evaluated by two hepatologists, a hepatic surgeon and a hepatic physician, to be
useful in finding a good timing to make actions in treating hepatitis patients. This
evaluation is subjective in the sense that too small part of the large number of patients
in the data were observed to follow the entire scenarios obtained. That is, a scenario
corresponding to fewer patients may seem less trustworthy. However, we can say our
discovery process was made quite well under the hard condition that it is very rare
that the full scenarios of really critical worsening or exceptionally successful
treatment occur. We can say a new and useful scenario can be created merging the
subjective interests of experts reflecting their real-world experiences and the objective
tendencies in the data. Rather than discussing about data-based discovery, our point
is the humans’ role in discoveries.
participants of a co-working group for chance discovery, sharing the same visual
result. Then, words corresponding to bridges among the various daily contexts of
participants are visualized in the next step of visual data mining applied to the subject
data, i.e., the text data recording the thoughts and opinions in the discussion. Via the
participants’ understanding of these bridges, the islands get connected and form novel
scenarios.
By this time, the participants may have discovered chances on the bridges, because
each visualized island corresponds to a certain scenario familiar to some of the
participants and a bridge means a cross-point of those familiar scenarios. Based on
these chances, the user(s) make actions, or simulate actions in a virtual environment,
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 317
and obtain concerns with new chances – the helical process returns to the initial step
of the next cycle. DH is embodied in this paper, in the application to obtaining
hepatitis scenarios. Users watch and discuss on KeyGraph [Ohsawa 2003b] applied to
the subject data and the object data in the process, thinking and talking about
scenarios the diagram may imply.
KeyGraph is a computer-aided tool for visualizing the map of event relations in the
environment, in order to aid in the process of chance discovery. If the environment
represents a place of discussion, an event may represent a word in by a participant. By
visualizing the map where the words appear connected in a graph, one can see the
overview of participants’ interest. Suppose text-like data (string-sequence) D is
given, describing an event-sequence sorted by time, with periods (``.'') inserted at the
parts corresponding to the moments of major changes. For example, let text D be:
D = a1, a2, a4, a5 … .
a4, a5, a3, … .
a1, a2, a6, …. .
… a4, a5 .
a1, a2, , a5, … , a10.
a1, a2, a4, , … , , ... a10.
…. ( 2)
Frequent items in D a5 a4
a10
market is losing customers, and Island (2) shows the context of a target company’s
current state. The bridge “restructuring” shows the company may introduce
restructuring, e.g. firing employees, for surviving in the bad state of the market.
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 319
Fig. 4. KeyGraph for D in Eq.(2). Each island includs event-set {customers, decrease, market},
{steel, concrete}, {company}, etc. The double-circled node and the red (“restructuring”) node
show a frequent word and a rare word respectively, which forms hubs of bridges
“Restructuring” might be rare in the communication of the company staffs, but this
expresses the concern of the employees about restructuring in the near future.
D=“
Speaker A : In the market of general construction, the customers decreased.
Speaker B: Yes… My company build from concrete and steel, is in this bad
trend.
Speaker C: This state of the market induces a further decrease of customers. We
may have to introduce restructuring, for satisfying customers.
Speaker B: I agree. Then the company can reduce the production cost of
concrete and steel. And, the price for the customers of construction...
Speaker D: But, that may reduces power of the company. ” (3)
In the case of a document as in Eq.(3), periods are put at the end of each sentence,
and the result shows the overview of the content. In the case of a sales (Position Of
Sales: POS) data, periods are put in the end of each basket. KeyGraph, of the
following steps, is applied to D ([Ohsawa 2003b] for details).
In the case of marketing, users of KeyGraph sold out new products, and made real
business profits [Usui and Ohsawa 2003]. The market researchers visualized the map
of their market using KeyGraph, where nodes correspond to products and links
corresponding to co-occurrences between products in the basket data of customers. In
this map, market researchers found a valuable new scenario of the life-style of
320 Y. Ohsawa et al.
customers who may buy the product across a wide range in the market, whereas
previous methods of data-based marketing helped in identifying focused segments of
customers and the scenarios to be obtained have been restricted to ones of customers
in each local segment. As a result, users successfully found promising new products,
in a new desirable and realistic (i.e. possible to realize) scenario emerging from
marketing discussions where scenarios of customers in various segments were
exchanged. This realized hits of new products appearing in KeyGraph at bridges
between islands, i.e. established customers.
However, a problem was that it is not efficient to follow the process of DH using
KeyGraph solely, because the user cannot extract an interesting part of the data easily,
when s/he has got a new concern with chances. For example, they may become
concerned with a customer who buys product A or product B and also buys product
C, but does not buy product D. Users has been waiting a tool to look into such
customers matching with concern.
4.3 The Process Using TextDrop for Retrieving Data Relevant to User’s Concern
Step 5) Execute or simulate (draw concrete images of the future) the scenarios
obtained in Step 4), and, based on this experience, refine the statement of the new
concern in concrete words. Go to Step 1).
The following shows the style of data obtained from blood-tests of hepatitis cases.
Each event represents a pair, of a variable and its observed value. That is, an event put
as “a_b” means a piece of information that the value of variable a was b. For example,
T-CHO_high (T-CHO_low) means T-CHO (total cholesterol) was higher (lower) than
a predetermined upper (lower) bound of normal range. lower than the lower threshold,
if the liver is in a normal state. Note that the lower (higher) bound of each variable
was set higher (lower) than values defined in hospitals, in order to be sensitive to
moments the variables take unusual values. Each sentence in the data, i.e., a part
delimited by ‘.’ represents the sequence of blood-test results for the case of one
patient. See Eq. (3).
Case1 = {event1, event2, ….., event m1}.
Case2 = {event 2, event 3, …., event m2}.
Case3 = {event 1, event 5, …, event m3}. (3)
As in Eq.(3), we can regard one patient as a unit of co-occurrence of events.
That is, there are various cases of patients and the sequence of one patient’s events
means his/her scenario of wandering in the map of symptoms.
For example, suppose we have the data as follows, where each event means a value
of a certain attribute of blood, e.g. GPT_high means the state of a patient whose value
of GPT exceeded its upper bound of normal range. Values of the upper and the lower
bounds for each attribute are pre-set according to the standard values shared socially.
Each period (‘.’) represents the end of one patient’s case. If the doctor becomes
interested in patients having experiences of both GPT_high and TP_low, then s/he
eanters “GPT_high & TP_low” to TextDrop in Step 1), corresponding to the
underlined items below. As a result, the italic sentences are extracted and given to
KeyGraph in Step 2).
The scenario map in Fig.5, for all data of hepatitis B extracted by TextDrop entering
query “type-B,” was shown to a hepatic surgeon and a hepatic physician at Step 2)
above, in the first cycle. This scenario map was co-manipulated by a technical expert
of KeyGraph and these hepatologists at Step 3), with talking about scenarios of
hepatitis cases. Each cycle was executed similarly, presenting a scenario map for the
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 323
data extracted according to the newest concern of the hepatologists. For example, in
Fig.5, the dotted links show the bridges between events co-occurring more often than
a given threshold of frequency. If there is a group of three or more events co-
occurring often, they become visualized as an island in KeyGraph.
At Step 3-1), participants grouped the nodes in the circles as in Fig.5, after getting
rid of unessential nodes from the figure and resolving redundancies by unifying such
items as “jaundice” and “T-BIL_high” (high total bilirubin) meaning the same into
one of the those items. We wrote down the comments of hepatologists at Step 3-2),
about scenarios of hepatitis progress/cure, looking at the KeyGraph. Each ‘?’ in Fig. 5
is the part about which the hepatologists could not understand clearly enough to
express in a scenario, but said interesting. In figures hereafter, the dotted frames and
their interpretations were drawn manually reflecting the hepatologists’ comments.
Yet, all obtained by this step were short parts of common-sense scenarios about
hepatitis B, as follows:
(Scenario B1) Transition from/to types of hepatitis exist, e.g. CPH (chronic persistent
hepatitis) to CAH (chronic aggressive hepatitis), CAH2A (non-severe CAH) to
CAH2B (severe CAH), and so on.
(Scenario B2) Biliary blocks due to ALP_high , LAP_high, and G-GTP_high can lead
to jaundice i.e., increase in the T-BIL. Considering D-BIL increases more keenly
than I-BIL, this may be from the activation of lien, due to the critical illness of
liver.
(Scenario B3) The increase in immunoglobulin (G_GL) leads to the increase in ZTT.
Then, at Step 4), we applied KeyGraph to the memo of comments obtained in Step 3-
2). The comments lasted for two hours, so we can not put all the content here. The
short part of is below, for example.
[Comments of a surgeon looking at Fig.5]
Hepatitis B is from virus, you know. This figure is a mixture of scenarios in
various cases. For example, AFP/E is a marker of tumor. CHE increases faster
than T-BIL and then decreases, as far as I have been finding. If cancer is found
and grows, the value of CHE tents to decrease. However, it is rare that the
cancer produces CHE.
…
Jaundice i.e., the increase in T-BIL, is a step after the hepatitis is critically
progressed. Amylase from sialo and pancreas increase after operation of tumors
and cancer, or in the complete fibrosis of liver.
…
PL is low in case liver is bad. In the complete fibrosis CRE decrease.
…
LDH increases and decreases quickly for fulminate hepatitis B..
…
From the KeyGraph applied to the comments above, we obtained Fig.6. According
to Fig.6, their comments can be summarized as :
324 Y. Ohsawa et al.
For treatment, diagnosis based on bad scenarios in which hepatitis grows is essential.
However, this figure shows a complicated mixture of scenarios of various contexts,
i.e. acute, fulminant, and other types of hepatitis. We can learn from this figure that a
biliary trouble triggers the worsening of liver, to be observed with jaundice
represented by high T-BIL (total bilirubin).
Fig. 5. The scenario map, at an intermediate step of manipulations for hepatitis B. In this case,
all links are dotted lines because islands with multiple items were not obtained
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 325
Fig. 8. The scenario map for severe chronic aggressive hepatitis, i.e. CAH2B
328 Y. Ohsawa et al.
Fig. 9. The scenario map for hepatitis B, in the spotlight of F1, F2, F3, and F4 (LC)
Especially, the surgeon’s feeling in event 2) was his tacit experience reminded by
KeyGraph obtained here, but has been published nowhere yet. These implications of
Fig.8 drove us to separate the data to ones including progressive scenarios and others,
so we extracted the stepwise progress of fibrosis denoted by F1, F2, F3, and F4 (or
LC: liver cirrhosis). Fig. 9 is the scenario map for hepatitis B, corresponding to the
spotlights of F1, F2, F3, and F4 (LC), i.e., for data extracted by TextDrop with entry
“type-B & (F1 | F2 | F3 | F4 | LC)”.
Fig.9 matched with hepatic experiences in the overall flow, as itemized below.
These results are useful for understanding the state of a patient at a given time of
observation in the transition process of symptoms, and for finding a suitable time and
action to do in the treatment of hepatitis B.
- A chronic active hepatitis sometimes changes into a severe progressive hepatitis
and then to cirrhosis or cancer, in the case of hepatitis B.
- The final states of critical cirrhosis co-occurs with kidney troubles, and get
malignant tumors (cancer) with deficiencies of white blood cells.
- Recovery is possible from the earlier steps of fibrosis.
- The low LDH after the appearance of high LDH can be an early sign of fulminant
hepatitis (see item 2) above).
- The low Fe level with cirrhosis can be a turning point to a better state of liver. In
[Rubin et al 1995], it has been suggested here that iron reduction improves the
response of chronic hepatitis B or C to interferon. Fig.9 does not show anything
about the effect of interferon, but the appearance of FE_low, on the only bridge
from cirrhosis to recovery, is very useful for finding the optimal timing to treat a
patient of hepatitis B, and is relevant to Rubin’s result in this sense.
For the cases of hepatitis C, as in Fig.10, we also found a mixture of scenarios, e.g.,
(Scenario C1) Transition from CPH to CAH
(Scenario C2) Transition to critical stages like cancer, jaundice, etc
These common-sense scenarios are similar to the scenarios in the cases of hepatitis
B, but we also find “interferon” and a region of cure at the top (in the dotted ellipse)
in Fig.10. GOT and GPT (i.e. AST and ALT, respectively) can be low both after the
fatal progress of heavy hepatitis and when the disease is cured. The latter case is rare
because GOT and GPT are expected to take “normal” value, i.e., between the lower
and the upper threshold, rather than being “low” i.e. lower than the lower threshold of
normal state.
However, due to the setting of lower bound of variables such as GOT and GPT to a
high value, here we can find moments these variables take “low” values when the
case changed to a normal states. As well, given the low value of ZTT in this region,
we can judge the case is in the process to cure.
This suggested a new concern that the data with “interferon & ZTT_low” (i.e.
interferon has been used, and the value of ZTT recovered) may clarify the role of
interferon, i.e., how the process of curing hepatitis goes with giving interferon to the
patient. Fig. 11 was obtained, for the data corresponding to this concern and extracted
using TextDrop. In Fig.11, we find features as follows:
330 Y. Ohsawa et al.
Fig. 10. The scenario map for all the cases of hepatitis C
- The values of GPT and GOT are lessened with the treatment using interferon, and
then ZTT decreased.
- Both scenarios of cure and worsening progress are found, following the decrease
in ZTT. In the worse scenarios, typical bad symptoms such as jaundice and low
values of CHE and ALB (albumin) appear as a set. In the better, the recovery of
various factors such as blood platelets (PL_high) are obvious.
- In the worse (former) scenario, blood components such as PL are lessened.
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 331
Fig. 11. Scenario map for cases with interferon and low values of ZTT
In the better (latter) scenario, changes in the quantity of proteins are remarkable.
Among these, F-A2_GL (a2 globulin, mostly composed of haptoglobin, which
prevents the critical decrease in iron by being coupled with hemoglobin before
hemoglobin gets destroyed to make isolated iron) and F-B_GL (beta globulin,
composed mostly of transferrin which carries iron for reusing it to make new
hemoglobin) are relevant to the metabolism of iron (FE).
The realities and the significance of these scenarios were supported by the
physician and the surgeon. To summarize those scenarios briefly, the effect of
interferon works for cure only if the recycling mechanism of iron is active. In fact, the
relevance of hepatitis and iron has been a rising concern of hepatologists working on
hepatitis C, since Hayashi et al (1994) verified the effect of iron reduction in curing
hepatitis C.
332 Y. Ohsawa et al.
Fig. 12. Scenario map for cases with F-A2_GL_high & F_A2_GL_low, and interferon. The
effect of interferon is clarified at the top of the map
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 333
We obtained some qualitative new findings. For example, the low value as well as the
high value of LDH is relevant to the turning point of fulminant hepatitis B, in shifting
to critical stages. This means a doctor must not be relieved with finding that the value
of LDH decreased, only because the opposite symptom i.e. the increase in LDH
means a sign of hepatitis progress.
And, the effect of interferon is relevant to the change in the quantity of protein,
especially ones relevant to iron metabolism (e.g. F-A2_GL and F-B_GL). The effect
of interferon, as a result, appears to begin with the recovery of TP. These tendencies
are apparent from both Fig.11 and Fig.12, but it is still an open problem if interferon
is affected from such proteins as globulins or iron metabolism and interferon are two
co-operating factors of cure. All in all, “reasonable, and sometimes novel and useful”
scenarios of the hepatitis were obtained according to the surgeon and the physician.
Although not covered here, we also found other scenarios to approach significant
situations. For example, the increase in AMY (amylase) in Fig.5, Fig.8, Fig.9, and
Fig.10, can be relevant to surgical operations or liver cancers. The relevance of
amylase corresponds to Miyagawa et al (1996), and its relevance to cancer has been
pointed out in a number of references such as Chougle (1992) and also in commercial
web sources such as Ruben (2003).
These are the effects of double helix (DH), the concern-focusing process of user in
the interaction with the target data. Apparently, the subjective bias of hepatologists’
concern influenced the results to be obtained in this manner, and this subjectivity
sometimes becomes the reason why scientists discount the value of results obtained
by DH. However, the results are trustworthy because they just show the summary of
objective facts selected on the subjective focus of hepatologists’ concerns.
As a positive face of this combination of subjectivity and objectivity, we
discovered a broad range of knowledge useful in real-decisions, expressed in the form
of scenarios. We say this, because subjectivity has the effect of choosing interesting
part of the wide objective environment, and this subjective feeling of interestingness
comes from the desire for useful knowledge.
7 Conclusions
Scenario emergence, a new side of chance discovery, is useful for decisions in the real
world where events occur dynamically and one is required to make a decision
promptly at the time a chance, i.e. a significant event occurring. In this paper we
showed an application of scenario emergence with discovering triggering events of
essential scenarios, in the domain of hepatitis progress and treatment. Realistic and
novel scenarios were obtained according to experts of the target domain.
In the method presented here, the widening of user’s view was aided by the
projection of users’ personal experiences to the objective scenario map representing a
mixture of contexts in the wide real-world. The narrowing, i.e. data-choosing, was
334 Y. Ohsawa et al.
Acknowledgment
The study has been conducted in the Scientific Research on Priority Area “Active
Mining.” We appreciate Chiba University Hospital for serving us with the priceless
data, under the convenient contract of its use for research.
References
Chance Discovery Consortium 2004, http://www.chancediscovery.com
Chougle A; Hussain S; Singh PP; Shrimali R., 1992, Estimation of serum amylase levels in
patients of cancer head and neck and cervix treated by radiotherapy, Journal of Clinical
Radiotherapy and Oncology. 1992 Sept; 7(2): 24-26
Gaver W.W., Beaver J., and Benford S., 2003, Ambiguity as a Resource for Design, in
Proceedings of Computer Human Interactions
Hayashi, H., T. Takikawa, N. Nishimura, M. Yano, T. Isomura, and N. Sakamoto. 1994.
Improvement of serum aminotransferase levels after phlebotomy in patients with chronic
active hepatitis C and excess hepatic iron, American Journal of Gastroenterol. 89: 986-988
Miyagawa S, Makuuchi M, Kawasaki S, Kakazu T, Hayashi K, and Kasai H., 1996, Serum
Amylase elevation following hepatic resection in patients with chronic liver disease., Am. J.
Surg. 1996 Feb;171(2):235-238
Miyagawa S, Makuuchi M, Kawasaki S, and Kakazu T., 1994, Changes in serum amylase level
following hepatic resection in chronic liver disease, Arch Surg. 1994 Jun;129(6):634-638
Ohsawa Y and McBurney P. eds, 2003, Chance Discovery, Springer Verlag
Ohsawa Y., 2003a, Modeling the Process of Chance Discovery, Ohsawa, Y. and McBurney
eds, Chance Discovery, Springer Verlag pp.2—15
Ohsawa Y, 2003b, KeyGraph: Visualized Structure Among Event Clusters, in Ohsawa Y and
McBurney P. eds, 2003, Chance Discovery, Springer Verlag: 262-275
Okazaki N, 2003, Naoaki Okazaki and Yukio Ohsawa, "Polaris: An Integrated Data Miner for
Chance Discovery" In Proceedings of The Third International Workshop on Chance
Discovery and Its Management, Crete, Greece.
Rubin RB, Barton AL, Banner BF, Bonkovsky HL., 1995, Iron and chronic viral hepatitis:
emerging evidence for an important interaction. in Digestive Diseases
Ruben D., 2003, Understanding Blood Work: The Biochemical Profile,
http://petplace.netscape.com/articles/artShow.asp?artID=3436
Cooperative Scenario Mining from Blood Test Data of Hepatitis B and C 335
Abstract. This paper describes how data mining is being used to iden-
tify primary factors of cancer incidences and living habits of cancer pa-
tients from a set of health and living habit questionnaires. Decision tree,
radial basis function and back propagation neural network have been
employed in this case study. Decision tree classification uncovers the pri-
mary factors of cancer patients from rules. Radial basis function method
has advantages in comparing the living habits between a group of cancer
patients and a group of healthy people. Back propagation neural network
contributes to elicit the important factors of cancer incidences. This case
study provides a useful data mining template for characteristics identifi-
cation in healthcare and other areas.
1 Introduction
With the development of data mining approaches and techniques, the appli-
cations of data mining can be found in many organizations, such as banking,
insurance, industries, and government. Large volumes of data are produced in
every social organization, which can be from scientific research or business. For
example, the human genome data is being created and collected at a tremendous
rate, so the maximization of the value from this complex data is very necessary.
Since the ever increasing data becomes more and more difficult to be analyzed
with traditional data analysis methods, data mining has earned an impressive
reputation in data analysis and knowledge acquisition. Recently data mining
methods have been applied to many areas including banking, finance, insurance,
retail, healthcare and pharmaceutical industries as well as gene analysis[1, 2].
Data mining methods [3, 4, 5] are used to extract valid, previously unknown,
and ultimately comprehensible information from large data sources. The ex-
tracted information can be used to form a prediction or classification model,
data, applying several mining methods on the given mining data is recommended.
Since one mining algorithm may outperform another, more useful rules and pat-
terns can be further revealed.
This function can be easily extended for dimensions higher than 2 (see [14]).
As a regularization network, Rbf is equivalent to generalized splines. The
architecture of backpropagation neural network consists of multilayer networks
where one hidden layer and a set of adjustable parameters are configured. Their
Boolean version divides the input space into hyperspheres, each corresponding to
Integrated Mining for Cancer Incidence Factors from Healthcare Data 339
a center. This center, also call it a radial unit, is active if the input vector is within
a certain radius of its center and is otherwise inactive. With an arbitrary number
of units, each network can approximate the other, since each can approximate
continuous functions on a limited interval.
This property, we call it ”divide and conquer”, can be used to identify the
primary factors of the related cancer patients within the questionnaires. With
the ”divide and conquer”, Rbf is able to deal with training data, of which the
distribution of the training data is extremely severity.
Rbf is also effect on solving the problem in which the training data includes
noise. Because the transfer function can be viewed as a linear combination of
nonlinear basis functions which effectively change the weights of neurons in the
hidden layer. In addition, the Rbf allows its model to be generated in an efficient
way.
calculate the correction error δ2j = oj (1−oj )(yj −oj ), and adjust the weight w2ij
with the function: w2ij (t+1) = w2ij (t)+δ2j ·hj ·η, where w2ij (t) are the respec-
tive weights at time t, and η is a constant (η ∈ (0, 1)); (6) Calculate the
correction
error for the hidden layer by means of the formula δ1j = hj (1 − hj ) i δ2i · w2ij ,
and adjust the weights w1ij (t) by: w1ij (t + 1) = w1ij (t) + δ1j · xj · η; (7) Return
to step 3, and repeat the process.
The weight adjustment is also an error adjustment and propagation process,
where the errors (in the form of weights) are feedback to the hidden units. These
errors are normalized per unit fields in order to have a sum of all as 100%. This
340 X. Zhang and T. Narita
process is often considered as a sensitivity analysis process which shows the in-
put field (variable) contributions to the classification for a class label. Therefore,
omitting the variables that do not contribute will improve the training time and
those variables are not included in the classification run. As we know, real data
sets often include many irrelevant or redundant input fields. By examining the
weight matrix of the trained neural network itself, the significance of inputs can
be determined. A comparison is made by sensitivity analysis, where the sensi-
tivity of outputs to input perturbation is used as a measure of the significance
of inputs. Practically, in the decision tree classification, by making use of sen-
sitivity analysis and removing the lowest contributed variables, understandable
and clear decision trees are easily generated.
When building a data mart for Rbf analysis, the analysis object is to predict
the probability of the cancer accident. The variable ”cancer flag” is mapping
to a new variable ”probV26” with value of [0,1]. This new variable is used as a
dependence variable, whose values (e.g., probability) will be predicted using the
other variables (independence variables).
With Rbf, several predictive models have been generated. First, one predic-
tive model is generated with the dependence variable ”probV26” (mentioned
before) and all independence variables. More predictive models have been built
for male and female cancer patients. Second, in order to further discover the dif-
ferent living habits between the cancer patient group and the non-cancer people,
the independent variables of predictive models are selected from only personal
information and illness information, from only health statute of families, and
from only drinking activity and smoking information respectively. With these
data marts, related Rbf predictive models are built respectively.
Fig. 1 shows the Rbf predictive models which predict probability of cancer
incidences, where Rbf automatically builds small, local predictive models for
different regions of data space. This ”divide and conquer” strategy appears well
for prediction and factor identifier. This chart indicates eating habits in terms
of food-eating frequency. The segment with low probability of cancer incidence
segment (at bottom) shows the chicken-eating frequency (the variable located
in the second from the left) is higher than that of the segment (at top) with
higher probability of cancer incidences. The detailed explanation of distribution
is described in Fig. 2, where for a numerical variable, the distributions in the
population and in the current segment are indicated. Moreover, the percentage
of each distribution can be given if necessary. With Rbf predictive ability, the
characteristic of every segment can be identified. By means of Rbf, the living
habits of cancer patients and non-cancer or healthy people are discovered.
Fig. 2 shows an example of distributions of a variable. The histogram chart
is for a numerical variable in a segment of Rbf models, where the distributions
of population and current segment of the variable are described, respectively.
Integrated Mining for Cancer Incidence Factors from Healthcare Data 343
Table 1. Comparison among female patients and non-cancer women in suffered illness
For instance, the percentage of partition 4 in the current segment is higher than
that of the population (the entire data set). The pie chart is for a categorical
variable. The outside ring of the pie chart shows the distribution for the variable
over the population. The inside ring shows the distribution for this variable in
the current segment. In the pie chart of Fig. 2, the distribution of No in the
current segment is less than that of the population.
The results of Rbf mining are interesting. The results provide comparable
information among cancer patients and non-cancer people in suffered illness.
Some results from Rbf are described in Table 1. In this table, there is a prediction
case of cancer incidence of women, 6.5% of a high cancer incidence group has
kidney illness, while the percentage for non-cancer people is 3.5%. For womb
illness and blood transfusion, the figures for female cancer patients are higher
than those of non-cancer women.
The eating habits between the group of female cancer incidences and the group
of non-cancer women are compared. The results of Rbf (described in Fig. 3) show
the characteristics in the segment with highest cancer incidence probability (with
0.15 probability and segment ID 98) and that of lowest probability as well (with
0 probability and segment ID 57). By picking up the variables with obviously
different distributions in these two segments, comparable results are obtained.
Within eating habits, by comparing regular breakfast habit category, 87% of
female cancer patients have regular breakfast habit, which is lower than 93.3%
of non-cancer women. With meat-eating and chicken-eating 3-4 times per-week,
the percentage figures are 12% and 8% of female cancer patients, 27.7% and
344 X. Zhang and T. Narita
Fig. 3. Comparison of eating habits: female cancer incidences and non-cancer women
18% of non-cancer women, respectively. With one’s personal life, 54.5% of female
cancer patients are living in a state of stress, far higher than 15% the percentage
of non-cancer women. For judgment on matters, 36% of female cancer patients
gives out their judgments very quickly. This percentage is also higher than 13%,
the percentage figure of non-cancer women.
has suffered cancer is also important. The factor hot favorance and rich oil
favorance are also noticed.
As described above, decision tree classification, Rbf and Bpn mining pro-
cesses generate useful rules, patterns and cancer incidence factors. By carefully
prepared mining data marts and removing irrelevant or redundant input vari-
ables, decision tree rules show the primary factors of cancer patients. Rbf pre-
dictive models reveal more detail information for comparison of cancer patients
with non-cancer people. Bpn obviously reports the important cancer incidence
factors. These results enable us to discover the primary factors and living habits
of cancer patients, their different characteristics compared with non-cancer peo-
ple, further indicating what are the most related factors contributing to cancer
incidences.
5 Concluding Remarks
In applying data mining to medical and healthcare area, the special issue in
[6] describes more detail of the mining results from variety of mining methods,
where common medical databases are presented. This is a very interesting way
to identifying the mined results. However, for a specific mining propose, the
researchers have not described which algorithm is effective and which result (or
part of results) is more useful or accurate compared to that generated with other
algorithms.
In applying decision tree to medical data, our earlier work [19] was done
with clustering and classification, where comparison between the cancer patient
group and the non-cancer people could not be given. In addition, the critical
346 X. Zhang and T. Narita
cancer incidence factors were not acquired. The study with Rbf has successfully
compared the difference between the cancer patient group and non-cancer people.
Bpn is used to find significant cancer incidence factors. With applying Rbf in
factor identification, a case study of semiconductor yield forecasting can be found
in [20]. The work of [9, 10] are applications of Bpn in genome. However, these
applications are based on single method for mining. Therefore, the mining results
can be improved with multistrategy data mining. The integration of decision tree,
Rbf and Bpn to do mining is an effective way to discover rules, patterns and
important factors.
Data mining is a very useful tool in the healthcare and medical area. Ideally,
large amounts of data (e.g., the human genome data) are continuously collected.
These data are then segmented, classified, and finally reduced suited for gen-
erating a predictive model. With an interpretable predictive model, significant
factors of a predictive object can be uncovered. This paper has described an inte-
grated mining method that includes multiple algorithms, and mining operations
for efficiently obtaining results.
Acknowledgements
This work was supported in part by the Project (No.2004D006) from Hubei
Provincial Department of Education, P. R. China.
References
1. S. L. Pomeroy et al. Prediction of central nervous system embryonal tumour out-
come based on gene expression. Nature, 405:436–442, 2002.
2. Y. Kawamura and X. Zhang and A. Konagaya. Inference of genetic network in
cluster level. 18th AI Symposium of Japanese Society for Artificial Intelligence,
SIG-J-A301-12P, 2003.
3. R. Agrawal, T. Imielinski, and A. Swami. Database mining: A performance per-
spective. IEEE Transactions on Knowledge and Data Engineering, 5:914–925, 1993.
4. M.S. Chen, J. Han, and P.S. Yu. Data mining: An overview from a database
perspective. IEEE Transactions on Knowledge and Data Engineering, 8:866–883,
1996.
5. Xiaolong Zhang. Knowledge Acqusition and Revision with First Order Logic In-
duction. PhD Thesis, Tokyo Institute of Technology, 1998.
6. Special Issue. Comparison and evaluation of KDD methods with common medical
databases. Journal of Japanese Society for Artificial Intelligence, 15:750–790, 2000.
7. C. Apte, E. Grossman, E. Pednault, B. Rosen, F. Tipu, and B. White. Probablistic
estimation based data mining for discovering insurance risks. Technical Report
IBM Research Report RC-21483, T. J. Watson Research Center, IBM Research
Division, Yorktown Heights, NY 10598, 1999.
8. Gedeon TD. Data mining of inputs: analysing magnitude and functional measures.
Int. J. Neural Syst, 8:209–217, 1997.
9. Wu Cathy and S. Shivakumar. Back-propagation and counter-propagation neural
networks for phylogenetic classification of ribosomal RNA sequences. Nucleic Acids
Research, 22:4291–4299, 1994.
Integrated Mining for Cancer Incidence Factors from Healthcare Data 347
10. Wu Cathy, M. Berry, S. Shivakumar, and J. McLarty. Neural networks for full-scale
protein sequence classification: Sequence encoding with singular value decomposi-
tion. Machine Learning, 21:177–193, 1994.
11. L. Breiman, J. Friedman amd R. Olshen, and C. Stone. Classification and Regres-
sion Trees. Belmont, CA: Wadsworth, 1984.
12. J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
13. J.C. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier
for data mining. In Proc. of the 22th Int’l Conference on Very Large Databases,
Bombay, India, 1996.
14. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of
the IEEE, 78:1481–1497, 1990.
15. Roderick J. A. Little and Donald B. Rubin. Statistical analysis with missing data.
John Wiley & Sons, 1987.
16. Dempster A., Laird N., and Rubin D. Maximun likelihood from incomplete data
via the EM algorithm. J. Roy. Statist. Soc. B, 39:1–38, 1977.
17. IBM Intelligent Miner for Data. Using the Intelligent Miner for Data. IBM Corp.,
Third Edition, 1998.
18. P. Cabena et al. Discovering data mining. Prentice Hall PTR, 1998.
19. X. Zhang and T. Narita. Discovering the primary factors of cancer from health
and living habit questionaires. In S. Arikawa and K. Furukawa, editors, Discovery
Science: Second International Conference (DS’99). LNAI 1721 Springer, 1999.
20. Ashok N. Srivastava. Data mining for semiconductor yield forecasting. Future Fab
International, 1999.
Author Index