Customer Churn Prediction - A Case Study in Retail
Customer Churn Prediction - A Case Study in Retail
Customer Churn Prediction - A Case Study in Retail
discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/47749836
CITATIONS READS
7 840
3 authors, including:
All content following this page was uploaded by Sami Nousiainen on 02 December 2016.
Organized by:
Markus Ackermann,
Carlos Soares,
Bettina Guidemann
In Partnership with
SAS Deutschland, Heidelberg
i
ii
Preface
General Description
The Workshop on Practical Data Mining is jointly organized by the Institute for
Computer Science of University of Leipzig, the Artificial Intelligence and Com-
puter Science Laboratory of the University of Porto and SAS Deutschland with
the goal of gathering researchers and practitioners to discuss relevant experi-
ences and issues in the application of data mining technology in practice and to
identify important challenges to be addressed.
We are sure that this workshop will provide a forum for the fruitful interaction
between participants from universities and companies, but we aim to go beyond
that! We hope that this workshop will become the starting point for practical
projects that involve people from the two communities. The future will tell if we
succeeded.
Motivation
Topics
iii
Invited Talks
We are very pleased to have on the workshop two invited talks. The first one is
by Stefan Wrobel of Fraunhofer Institute for Autonomous Intelligent Systems on
Geo Intelligence New Business Opportunities and Research Challenges in Spa-
tial Mining and Business Intelligence. And the second talk is by Ulrich Reincke of
SAS Institute, who speaks about Directions of Analytics, Data and Text Mining
a Software Vendors View
Paper Acceptance
There were 24 papers submitted to this workshop. Each paper was reviewed by
at least two reviewers. Based on the reviews, 10 papers were selected for oral
presentation at the workshop and 9 for poster presentation.
Acknowledgments
We wish to thank the Program Chairs and the Workshop Chairs of the ECML-
PKDD 2006 conference, in particular Tobias Scheffer, for their support.
We thank the members of the Program Committee for the timely and thor-
ough reviews and for the comments which we believe will be very useful to the
authors.
We are also grateful to DMReview and KMining for their help in reaching
audiences which we otherwise would not be able to.
Thanks go also to the SAS Deutschland, Heidelberg, for their support and
partnership, and to our sponsors, Project Triana and Project Site-O-Matic, both
of University of Porto.
iv
Organization
Program Committee
Alpio Jorge, University of Porto, Portugal
Andr Carvalho, University of So Paulo, Brazil
Arno Knobbe, Kiminkii/Utrecht University, The Netherlands
Carlos Soares, University of Porto, Portugal
Dietrich Wettschereck, Recommind, Germany
Dirk Arndt, DaimlerChrysler AG, Germany
Donato Malerba, University of Bari, Italy
Ftima Rodrigues, Polytechnical Institute of Porto, Portugal
Fernanda Gomes, BANIF, Portugal
Floriana Esposito, Universit degli Studi di Bari, Italy
Gerhard Heyer, University of Leipzig, Germany
Gerhard Paa, Fraunhofer Institute, Germany
Lubos Popelinsky, Masaryk University, Czech Republic,
Luc Dehaspe, PharmaDM
Manuel Filipe Santos, University of Minho, Portugal
Mrio Fernandes, Portgs, Portugal
Marko Grobelnik, Josef Stefan Institute, Slovenia
Markus Ackermann, University of Leipzig, Germany
Mehmet Gker, PricewaterhouseCoopers, USA
Michael Berthold, University of Konstanz, Germany
Miguel Calejo, Declarativa/Universidade do Minho, Portugal
Mykola Pechenizkiy, University of Jyvskyl, Finland
Paula Brito, University of Porto, Portugal
Paulo Cortez, University of Minho, Portugal
Pavel Brazdil, University of Porto, Portgual
Peter van der Putten, Chordiant Software/Leiden University, The Nether-
lands
Petr Berka, University of Economics of Prague, Czech Republic
Pieter Adriaans, Robosail
Raul Domingos, SPSS, Portugal
Reza Nakhaeizadeh, DaimlerChrysler AG, Germany
Robert Engels, CognIT, Germany
Rdiger Wirth, DaimlerChrysler AG, Germany
v
Rui Camacho, University of Porto, Portugal
Ruy Ramos, University of Porto, Portugal
Sascha Schulz, Humboldt University of Berlin, Germany
Stefano Ferilli, University of Bari, Italy
Steve Moyle, Secerno, United Kingdom
Teresa Godinho, Allianz Portugal, Portugal
Timm Euler, University of Dortmund, Germany
In Partnership with
SAS Deutschland, Heidelberg
Sponsors
vi
Table of Contents
Invited Talks
Geo Intelligence New Business Opportunities and Research Challenges
in Spatial Mining and Business Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Stefan Wrobel
Directions of Analytics, Data and Text Mining A software vendors view 2
Ulrich Reincke
CRM
Sequence Mining for Customer Behaviour Predictions in
Telecommunications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Frank Eichinger, Detlef D. Nauck, Frank Klawonn
Machine Learning for Network-based Marketing . . . . . . . . . . . . . . . . . . . . . . . 11
Shawndra Hill, Foster Provost, Chris Volinsky
Customer churn prediction a case study in retail banking . . . . . . . . . . . . . 13
Teemu Mutanen, Jussi Ahola, Sami Nousiainen
Predictive Marketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Jochen Werner
Customer churn prediction using knowledge extraction from emergent
structure maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Alfred Ultsch, Lutz Herrmann
vii
Health Care and Medical Applications
Mining Medical Administrative Data The PKB System . . . . . . . . . . . . . . 59
Aaron Ceglar, Richard Morrall, John F. Roddick
Combination of different text mining strategies to classify requests
about involuntary childlessness to an internet medical expert forum . . . . . 67
Wolfgang Himmel, Ulrich Reincke, Hans Wilhelm Michelmann
Industry
Data Mining Applications for Quality Analysis in Manufacturing . . . . . . . . 79
Roland Grund
Towards Better Understanding of Circulating Fluidized Bed Boilers:
Getting Domain Experts to Look at the Data Mining Perspective . . . . . . . 80
Mykola Pechenizkiy, Tommi Karkkainen, Andriy Ivannikov, Antti
Tourunen, Heidi Nevalainen
Tools
Bridging the gap between commercial and open-source data mining
tools: a case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Matjaz Kukar, Andrej Zega
viii
Geo Intelligence New Business Opportunities
and Research Challenges in Spatial Mining and
Business Intelligence
(Invited Talk)
Stefan Wrobel
Abstract
Every customer has an address, every store has a location, and traffic networks
are a decisive factor in accessibility and logistics. Even in classical business data
analysis, a large majority of data have a spatial component, and optimal business
decisions must take geographical context into account. In the talk, we will present
several examples of real world customer projects ranging from location selection
and geo-marketing to outdoor media. We will then move on to the new challenges
and opportunities brought about by the widespread availability of localisation
technology that allows tracking of people and objects in time and space.
Professor Dr. Stefan Wrobel is a professor of computer science at university
of Bonn and one of the three directors of the Fraunhofer Institute for Intelligent
Analysis and Information Systems IAIS (created in July 2006 as a merger of
Fraunhofer Institutes AIS and IMK).
1
Directions of Analytics, Data and Text Mining
A software vendors view
(Invited Talk)
Ulrich Reincke
Abstract
After the years of hype at the turn of the millennium, immediately followed by
the crush of the dot.com-bubble, data mining has become a mature market. New
business applications are continuously developed even for remote industries and
total new data sources are becoming increasingly available to be explored with
new Data Mining methods. The common type of data sources moved initially
from numerical over time-stamped to categorical and text, while the latest chal-
lenges are geographic, biological and chemical information, that are both of text
and numerical type coupled with very complex geometric structures.
If you take a closer look at the concrete modelling options of both freeware
and commercial data mining tools, there is pretty little difference between them.
They all claim to provide their users with the latest analysis models that are con-
sensus within the discussions of the research community. However, what makes
a big difference, is the ability to map the data mining process into a continuous
IT-flow, that controls the full information from the raw data, cleaning aggre-
gation and transformation, analytic modelling, operative scoring, and last but
not least final deployment. This IT process needs to be set up as to secure that
the original business question is solved and the resulting policy actions are ap-
plied appropriately in the real world. This ability constitutes a critical success
factor in any data mining project of larger scale. Among other environmental
parameters of a mining project it depends mainly on clean and efficient meta-
data administration and the ability to cover and administer the whole project
information flow with one software platform: data access, data integration, data
mining, scoring and business intelligence. SAS is putting considerable effort to
pursue the development of its data mining solutions in this direction. Examples
of real life projects will be given.
2
Sequence Mining for Customer Behaviour
Predictions in Telecommunications
1 Introduction
3
In the area of data mining, many approaches have been investigated and im-
plemented for predictions about customer behaviour, including neural networks,
decision trees and nave Bayes classifiers (e.g., [1, 2, 3]). All these classifiers work
with static data. Temporal information, like the number of complaints in the last
year, can only be integrated by using aggregation. Temporal developments, like
a decreasing monthly billing amount, are lost after aggregation. In this paper,
sequence mining as a data mining approach that is sensitive to temporal devel-
opments is investigated for the prediction of customer events.
In Chapter 2 we present sequence mining and its adoptions for customer data
in telecommunications along with an extended tree data structure. In Chapter
3, a combined classification framework is proposed. Chapter 4 describes some
results with real customer data and Chapter 5 concludes this paper and points
out the lessons learned.
2 Sequence Mining
Sequence mining was originally introduced for market basket analysis [4] where
temporal relations between retail transactions are mined. Therefore, most se-
quence mining algorithms like AprioriAll [4], GSP [5] and SPADE [6] were de-
signed for mining frequent sequences of itemsets. In market basket analysis,
an itemset is the set of different products bought within one transaction. In
telecommunications, customer events do not occur together with other events.
Therefore, one has to deal with mining frequent event sequences, which is a
specialisation of itemset sequences. Following the Apriori principle [7], frequent
sequences are generated iteratively. A sequence of two events is generated from
frequent sequences consisting of one event and so on. After generating a new
candidate sequence, its support is checked in a database of customer histories.
The support is defined as the ratio of customers in a database who contain the
candidate sequence in their history.
4
customers correctly who contain a very significant sequence but with an extra
event in between. This extra event could be a simple call centre enquiry which
is not related to the other events in the sequence. The original definition would
allow many extra events occurring after a matched sequence. In the application
to customer behaviour prediction, a high number of more recent events after a
significant sequence might lower its impact. Therefore, we introduce two new
sequence mining parameters: maxGap, the maximum number of allowed extra
events in between a sequence and maxSkip, the maximum number of events at
the end of a sequence before the occurrence of the event to be predicted. With
these two parameters, it is possible to determine the support of a candidate se-
quence very flexibly and appropriately for customer behaviour predictions. For
instance, the presented example is true if maxGap = 2 and maxSkip = 3. It is
not true any more, if one of the parameters is decreased.
Multiple database scans, which are necessary after every generation of candidate
sequences, are considered to be one of the main bottlenecks of Apriori-based
algorithms [8, 9]. Such expensive scans can be avoided by storing the database
of customer histories efficiently in main memory. In association rule mining,
tree structures are used frequently to store mining databases (e.g., [8]). In the
area of sequence mining, trees are not as attractive as lattice and bitmap data
structures (e.g., [6, 9]). This is due to smaller compressing effects in the presence
of itemsets. In our case, as well as in the application of sequence mining to
web log analysis (e.g., [10]) where frequent sequences of single events are mined,
tree structures seem to be an efficient data structure. In this paper, such a tree
structure or more precisely trie memory4 [11] as known from string matching
[12], is employed to store sequences compressed in main memory. We call our
data structure SequenceTree.
In the SequenceTree, every element of a sequence is represented in an inner-
or leaf node. The root node and all inner nodes contain maps of all direct suc-
cessor nodes. Each child represents one possible extension of the prefix sequence
defined by the parent node. The root node is not representing such an element,
it just contains a map of all successors, which are the first elements from all
sequences. Every node, except the root node, has an integer counter attached
which indicates how many sequences are ending there.
An example for a SequenceTree containing five sequences is given in Figure
1. To retrieve the sequences from the tree, one can start at every node with a
counter greater than zero and follow the branch in the tree towards the root
node. Note that if the sequence hA B Ci is stored already, just a counter
needs to be increased if the same sequence is added again. If one wants to add
hA B C Di, the last node with the C becomes an inner node and a new
leaf node containing the event D with a count of one is added.
4
Tries are also called prefix trees or keyword trees.
5
{C} {}
{B, C}
{A, B, C} {}
{}
6
2.3 Sequence Mining Using the Sequence Tree
7
event. The result was a set of sequential association rules like the following one:
hEN QU IRY EN QU IRY REP AIRi, confidence = 4.4%, support =
1.2%, meaning that 1,2% of all customers display the pattern with a specific
repair first, than an enquiry followed by another enquiry in their event history.
4.4% of all customers with this pattern are having a churn event afterwards.
Therefore, we know that customers displaying this pattern have a slightly higher
churn probability than the average of all customers. On the other hand, a support
of 1,2% means that just a small fraction of all customers is affected. This rule
is just one rule in a set of around hundred rules (depending on the predefined
minimum support). Only some rules exist with a higher confidence of around
10%, but they affect even smaller fractions of customers. Even if such a rule
with a high confidence of e.g. 10% is used to identify churners, this rule would
still classify 90% of the customers incorrectly. Therefore, even a large set of
sequential association rules was not suitable for churn prediction.
4 Experimental Results
The combined classifier was applied to real customer data from a major European
telecommunication provider. In this paper, just some results from a churn predic-
tion scenario are presented, even if the model was tested in a number of different
scenarios for different events. For reasons of data protection, non-representative
random test samples with a predefined churn rate had to be generated. In a
three months churn prediction scenario, the combined classifier was first trained
8
Model Building Process
sequential
sequence mining
preprocessing association
with min. support
rules
classification
decision tree learning
model:
and pruning
sequences with
(for every rule)
decision trees
customer
database
Classification Process
with historic data including all events within one year and then applied to a test
set from a more recent time window. The classifier found 19.4% of all churners
with a false positive rate of only 2.6%. The gain6 of this test result the ratio
how much the classifier is better than a random classifier is 5.5.
It is hard to compare our test results. On the one hand, all results related
to customer data are usually confident and therefore they are not published. On
the other hand, published results are hardly comparable due to differences in
data and test scenarios. Furthermore, most published results were achieved by
applying the predictive model to a test set from the same time window (e.g., [3])
instead of making future predictions.
9
mining sequences of single events. We showed that our tree structure in combi-
nation with hashing techniques is very efficient.
Our investigations showed that sequence mining alone is not suitable for
making valuable predictions about the behaviour of customers based on typically
rare events like churn. However, it is capable of discovering potentially interesting
relationships concerning the occurrence of events.
Furthermore, our study showed that it is more promising to analyse temporal
developments by employing sequence mining in combination with other classifiers
than to use only static classification approaches.
References
[1] Yan, L., Miller, D.J., Mozer, M.C., Wolniewicz, R.: Improving Prediction of Cus-
tomer Behavior in Nonstationary Environments. In: Proc. International Joint
Conference on Neural Networks (IJCNN). (2001)
[2] Buckinx, W., Baesens, B., den Poel, D., van Kenhove, P., Vanthienen, J.: Using
Machine Learning Techniques to Predict Defection of Top Clients. In: Proc. 3rd
International Conference on Data Mining Methods and Databases. (2002) 509517
[3] Neslin, S.A., Gupta, S., Kamakura, W., Lu, J., Mason, C.H.: Defection Detec-
tion: Measuring and Understanding the Predictive Accuracy of Customer Churn
Models. Journal of Marketing Research 43(2) (2006) 204211
[4] Agrawal, R., Srikant, R.: Mining Sequential Patterns. In: Proc. 11th International
Conference on Data Engineering (ICDE). (1995) 314
[5] Srikant, R., Agrawal, R.: Mining Sequential Patterns: Generalizations and Perfor-
mance Improvements. In: Proc. 5th International Conference Extending Database
Technology (EDBT). (1996) 317
[6] Zaki, M.J.: SPADE: An Efficient Algorithm for Mining Frequent Sequences. Ma-
chine Learning 42(12) (2001) 3160
[7] Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc.
20th International Conference Very Large Data Bases (VLDB). (1994) 487499
[8] Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Genera-
tion. In: Proc. ACM SIGMOD International Conference on Management of Data
(SIGMOD). (2000) 112
[9] Savary, L., Zeitouni, K.: Indexed Bit Map (IBM) for Mining Frequent Sequences.
In: Proc. 9th European Conference on Principles and Practice of Knowledge Dis-
covery in Databases (PKDD). (2005) 659666
[10] El-Sayed, M., Ruiz, C., Rundensteiner, E.A.: FS-Miner: Efficient and Incremental
Mining of Frequent Sequence Patterns in Web Logs. In: Proc. 6th ACM Workshop
on Web Information and Data Management (WIDM). (2004) 128135
[11] de la Briandais, R.: File Searching Using Variable Length Keys. In: Proc. Western
Joint Computer Conference. (1959) 295298
[12] Aho, A.V., Corasick, M.J.: Efficient String Matching: An Aid to Bibliographic
Search. Communications of the ACM 18(6) (1975) 333340
[13] Aho, A.V., Hopcroft, J.E., Ullman, J.D.: Data Structures and Algorithms. Series
in Computer Science and Information Processing. Addison-Wesley (1982)
[14] Ferreira, P.G., Azevedo, P.J.: Protein Sequence Classification Through Relevant
Sequence Mining and Bayes Classifiers. In: Proc. 12th Portuguese Conference on
Artificial Intelligence (EPIA). (2005) 236247
10
Machine Learning for Network-based Marketing
Abstract
Traditionally data mining and statistical modeling have been conducted assum-
ing that data points are independent of one another. In fact, businesses increas-
ingly are realizing that data points are fundamentally and inextricably inter-
connected. Consumers interact with other consumers. Documents link to other
documents. Criminals committing fraud interact with each other.
Modeling for decision-making can capitalize on such interconnections. In our
paper appearing in the May issue of Statistical Science (Hill et al. 2006), we
provide strong evidence that network-based marketing can be immensely more
effective than traditional targeted marketing. With network-based marketing,
consumer targeting models take into account links among consumers. We con-
centrate on the consumer networks formed using direct interactions (e.g., com-
munications) between consumers. Because of inadequate data, prior studies have
not been able to provide direct, statistical support for the hypothesis that net-
work linkage can directly affect product/service adoption. Using a new data set
representing the adoption of a new telecommunications service, we show very
strong support for the hypothesis. Specifically, we show three main results:
1) Network neighborsthose consumers linked to a prior customeradopt
the service at a rate 3-5 times greater than baseline groups selected by the best
practices of the firms marketing team. In addition, analyzing the network allows
the firm to acquire new customers that otherwise would have fallen through the
cracks, because they would not have been identified by models learned using
only traditional attributes.
2) Statistical models, learned using a very large number of geographic, de-
mographic, and prior purchase attributes, are significantly and substantially im-
proved by including network-based attributes. In the simplest case, including an
indicator of whether or not each consumer has communicated with an existing
customer improves the learned targeting models significantly.
3) More detailed network information allows the ranking of the network-
neighbors so as to permit the selection of small sets of individuals with very high
probabilities of adoption. We include graph-based and social-network features,
as well as features that quantify (weight) the network relationships (e.g., the
amount of communication between two consumers).
In general, network-based marketing can be undertaken by initiating vi-
ral marketing, where customers are given incentives to propagate information
themselves. Alternatively, network-based marketing can be performed directly
11
by any business that has data on interactions between consumers, such as phone
calls or email. Telecommunications firms are obvious candidates.
These results also provide an interesting perspective on recent initiations and
acquisitions of network communication services by non-telecom companies, for
example, gmail for Google, Skype for Ebay. Would targeting the social neighbors
of consumers who respond favorably to ads help Google? Clearly that depends
on the type of product. In addition, the results emphasize the intrinsic business
value of electronic community systems that provide explicit linkages between
acquaintances, such as MySpace, Friendster, Facebook, etc.
Based on: S. Hill, F. Provost, and C. Volinsky. Network-based marketing:
Identifying likely adopters via consumer networks. Statistical Science 21(2),
2006.
12
Customer churn prediction - a case study in
retail banking
1 Introduction
This paper will present a customer churn analysis in consumer retail banking
sector. The focus on customer churn is to determinate the customers who are
at risk of leaving and if possible on the analysis whether those customers are
worth retaining. A company will therefore have a sense of how much is really
being lost because of the customer churn and the scale of the efforts that would
be appropriate for retention campaign.
The customer churn is closely related to the customer retention rate and
loyalty. Hwang et al. [8] defines the customer defection the hottest issue in highly
competitive wireless telecom industry. Their LTV model suggests that churn rate
of a customer has strong impact to the LTV because it affects to the length of
service and to the future revenue. Hwang et al. also defines the customer loyalty
as the index that customers would like to stay with the company. Churn describes
the number or percentage of regular customers who abandon relationship with
service provider [8].
13
Table 1. Examples of churn prediction in literature.
possible churn. This enables the marketing department so that, given the limited
resources, the high probability churners can be contacted first [1].
Lester explains the segmentation approach in customer churn analysis [11].
She also points out the importance of the right characteristics studied in the cus-
tomer churn analysis. For example in the banking context those signals studied
might include decreasing account balance or decreasing number of credit card
purchases. Similar type of descriptive analysis has been conducted by Keveney
et al. [9]. They studied customer switching behavior in online services based on
questionnaires sent out to the customers. Garland has done research on customer
profitability in personal retail banking [7]. Although their main focus is on the
customers value to the study bank, they also investigate the duration and age
of customer relationship based on profitability. His study is based on customer
survey by mail which helped him to determine the customers share of wallet,
satisfaction and loyalty from the qualitative factors.
Table 1 presents examples of the churn prediction studies found in litera-
ture: the analysis of the churning customers have been conducted on various
fields. However, based on our best understanding, no practical studies have been
published related to retail banking sector focused on the difference between con-
tinuers and churners.
2 Case study
Consumer retail banking sector is characterized by customers who stays with a
company very long time. Customers usually give their financial business to one
company and they wont switch the provider of their financial help very often. In
the companys perspective this produces a stabile environment for the customer
relationship management. Although the continuous relationships with the cus-
tomers the potential loss of revenue because of customer churn in this case can be
14
huge. The mass marketing approach cannot succeed in the diversity of consumer
business today. Customer value analysis along with customer churn predictions
will help marketing programs target more specific groups of customers.
In this study a customer database from a Finnish bank was used and an-
alyzed. The data consisted only of personal customers. The data at hand was
collected from time period December 2002 till September 2005. The sampling
interval was three months, so for this study we had relevant data of 12 points
of time [t(0)-t(11)]. In logistic regression analysis we used a sample of 151 000
customers.
In total, 75 variables were collected from the customer database. These vari-
ables are related to the topics as follows: (1) account transactions IN, (2) account
transactions OUT, (3) service indicators, (4) personal profile information, and
(5) customer level combined information.
The data had 30 service indicators in total, (e.g. 0/1 indicator for housing
loan). One of these indicators, C1 tells whether the customer has a current
account in the time period at hand or not, and the definition of churn in the
case study is based on it. This simple definition is adequate for the study and
makes it easy to detect the exact moment of churn. The customers without
a C1 indicator before the time period were not included in the analysis. Their
volume in the dataset is small. In banking sector a customer who does leave, may
leave an active customer id behind because bank record formats are dictated by
legislative requirements.
This problem has been identified in the literature under term class imbalance
problem [10] and it occurs when one class is represented by a large number of
examples while the other is represented by only a few. The problem is particularly
crucial in an application, such as the present one, where the goal is to maximize
recognition of the minority class [4]. In this study a down-sizing method was
used to avoid all predictions turn out as nonchurners. The down-sizing (under-
sampling) method consists of the randomly removed samples from the majority
class population until the minority class becomes some specific percentage of the
majority class [3]. We used this procedure to produce two different datasets for
each time step: one with a churner/nonchurner ratio 1/1 and the other with a
ratio 2/3.
In this study we use binary predictions, churn and no churn. A logistic re-
gression method [5] was used to formulate the predictions. The logistic regression
model generates a value between bounds 0 and 1 based on the estimated model.
The predictive performances of the models were evaluated by using lift curve
and by counting the number of correct predictions.
15
3 Results
A collection of six different regression models was estimated and validated. Mod-
els were estimated by using six different training sets: three time periods (4, 6,
and 8) with two datasets each. Three time periods (t = 4, 6, 8) were selected for
the logistic regression analysis. This produced six regression models which were
validated by using data sample 3 (115 000 customers with the current account
indicator). In the models we used several independent variables, these variables
for each model are presented in the table 2. The number of correct predictions
is presented in each model in the table 3. In the validation we used the same
sample with the churners before the time period t=9 removed and the data for
validation was collected from time periods t(9) - t(11).
Table 2. Predictive variables that were used in each of the logistic regression models.
Notion X1 marks for training dataset with a churner/nonchurner ratio 1/1 and X2
for a dataset with a ratio 2/3. The coefficients of variable in each of the models are
presented in the table.
Model 41 42 61 62 81 82
Constant - - 0.663 - 0.417 -
Customer age 0.023 0.012 0.008 0.015 0.015 0.013
Customer bank age -0.018 -0.013 -0.017 -0.014 -0.013 -0.014
Vol. of (phone) payments in t=i-1 - - - - 0.000 0.000
Num. of trasactions (ATM) in t=i-1 0.037 0.054 - - 0.053 0.062
Num. of trasactions (ATM) in t=i -0.059 -0.071 - - -0.069 -0.085
Num. of transactions (card payments) t=i-1 0.011 0.013 - 0.016 0.020 0.021
Num. of transactions (card payments) t=i -0.014 -0.017 - -0.017 -0.027 -0.026
Num. of transactions (direct debit) t=i-1 0.296 0.243 0.439 0.395 - -
Num. of transactions (direct debit) t=i -0.408 -0.335 -0.352 -0.409 - -
Num. services, (not current account) -1.178 -1.197 -1.323 -1.297 -0.393 -0.391
Salary on logarithmic scale in t=i 0.075 0.054 - - - -
Although all the variables in each of the models presented in the table 2
were significant there could still be correlation between the variables. For ex-
ample in this study the variables Num. of transactions (ATM) are correlated in
some degree because they represent the same variable only from different time
period. This problem that arises when two or more variables are correlated with
each other is known as multicollinearity. Multicollinearity does not change the
estimates of the coefficients, only their reliability so the interpretation of the
coefficients will be quite difficult [13]. One of the indicators of multicollinearity
is high standard error values with low significance statistics. A number of formal
tests for multicollinearity have been proposed over the years, but none has found
widespread acceptance [13].
The lift curve will help to analyze the amount of true churners that are dis-
criminated in each subset of customers. In the figure 1 the % identified churners
16
Fig. 1. Lift curves from the validation-set (t=9) performance of six logistic regression
models. Model number (4, 6, and 8) represents the time period of the training set and
(1 and 2) represent the down-sizing ratio.
Table 3. Number and % share of the correct predictions (mean from the time periods
t=9, 10, 11). In the validation sample there were a 111 861 cases. The results were
produced by the models when the threshold value 0.5 was used.
17
are presented based on each logistic regression models. The lift curves were cal-
culated from the validation set performance. In the table 3 the models 41 , 61 ,
and 62 have correct predictions close to 60% where models 42 and 82 have above
70% of correct predictions. This difference between the five models has vanished
when amount of correct predictions is analyzed in the subsets as is presented in
the figure 1.
4 Conclusions
In this paper a customer churn analysis was presented in consumer retail bank-
ing sector. The different churn prediction models predicted the actual churners
relatively well. The findings of this study indicate that, in case of logistic regres-
sion model, the user should update the model to be able to produce predictions
with high accuracy since the independent variables of the models varies. The
customer profiles of the predicted churners werent included in the study.
It is interesting for a companys perspective whether the churning customers
are worth retaining or not. And also in marketing perspective what can be done
to retain them. Is a three month time span of predictions enough to make positive
impact so that the customer is retained? Or should the prediction be made for
example six months ahead?
The customer churn analysis in this study might not be interesting if the
customers are valued based on the customer lifetime value. The churn definition
in this study was based on the current account. But if the churn definition
was based on for example loyalty program account or active use of the internet
service. Then the customers at focus could possibly have greater lifetime value
and thus it would be more important to retain these customers.
References
1. Au W., Chan C.C., Yao X.: A Novel evolutionary data mining algorithm with appli-
cations to churn prediction. IEEE Trans. on evolutionary comp. 7 (2003) 532545
2. Buckinx W., Van den Poel D.: Customer base analysis: partial detection of behav-
iorally loyal clients in a non-contractual FMCG retail setting. European Journal of
Operational Research 164 (2005) 252268
3. Chawla N., Boyer K., Hall L., Kegelmeyer P.: SMOTE: Synhetic minority over-
sampling technique. Journal of Artificial Research 16 (2002) 321357
4. Cohen G., Hilario M., Sax H., Hugonnet S., Geissbuhler A.: Learning from imbal-
anced data in surveillance of nosocomial infection. Artificial Intelligence in Medicine
37 (2006) 718
5. Cramer J.S.: The Logit Model: An Introduction. Edward Arnold (1991). ISBN 0-
304-54111-3
6. Ferreira J., Vellasco M., Pachecco M., Barbosa C.: Data mining techniques on the
evaluation of wireless churn. ESANN2004 proceedings - European Symposium on
Artificial Neural Networks Bruges (2004) 483488
7. Garland R.: Investigating indicators of customer profitability in personal retail bank-
ing. Proc. of the Third Annual Hawaii Int. Conf. on Business (2003) 1821
18
8. Hwang H., Jung T., Suh E.: An LTV model and customer segmentation based on
customer value: a case study on the wireless telecommunication industry. Expert
Systems with Applications 26 (2004) 181188
9. Keaveney S., Parthasarathy M.: Customer Switching Behaviour in Online Services:
An Exploratory Study of the Role of Selected Attitudinal, Behavioral, and Demo-
graphic Factors. Journal of the Academy of Marketing Science 29 (2001) 374390
10. Japkowicz N., Stephen S.: The class imbalance problem: A systematic study. In-
telligent Data Analysis 6 (2002) 429449
11. Lester L.: Read the Signals. Target Marketing 28 (2005) 4547
12. Mozer M. C., Wolniewicz R., Grimes D.B., Johnson E., Kaushansky H.: Predicting
Subscriber Dissatisfaction and Improving Retention in the Wireless Telecommuni-
cation Industry. IEEE Transactions on Neural Networks, (2000)
13. Pindyck R., Rubinfeld D.: Econometric models and econometric forecasts.
Irwin/McGraw-Hill (1998). ISBN 0-07-118831-2.
14. Rosset S., Neumann E., Eick U., Vatnik N., Idan Y.: Customer lifetime value
modeling and its use for customer retention planning. Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining.
Edmonton, Canada (2002) 332-340
19
Predictive Marketing
Kampagnenbergreifende Marketingoptimierung und Echtzeitempfehlungen im Inbound
Jochen Werner
Das traditionelle Marketing fokussiert in der Regel auf das Produkt, bzw. auf die einzelnen Kampagnen.
Dort wird bereits hufig mit Data Mining eine recht hohe Optimierung von einzelnen Kampagnen erreicht,
allerdings wird in der Regel keine bergreifende Optimierung mehrerer Kampagnen ber die Zeit
vorgenommen, noch werden Inboundkanle wie Call Center und Web fr zielgerichtete Kampagnen
genutzt. Gerade dort liegt aber noch ein sehr hohes Optimierungspotential, welches SPSS mit Predictive
Marketing adressiert!
1/5
20
2. bergreifende Kampagnenoptimierung
2.4. Zusammenfassung
Die bergreifende Kampagnenoptimierung setzt auf bestehende Prognosemodelle auf und ist als
Ergnzung zu einem vorhandenen Kampagnenmanagement-System zu sehen. Es knnen deutlich
gesteigerte Responsequoten bei unverndertem Kampagnenvolumen erzielt werden.
2/5
21
2.5. Automatisierte Auswahl des richtigen Kanals und des richtigen Zeitpunktes
Der Kanal, ber den ein Kunde ein Angebot erhlt, kann ebenso wichtig sein, wie das Angebot selbst.
Der eine reagiert vielleicht eher auf eine E-Mail, der andere klickt sofort auf die Lschtaste, wrde
dasselbe Angebot im Geschft aber sofort annehmen. Bei der bergreifenden Optimierung wird der beste
Kanal fr jeden Kunden und jede Kampagne ermittelt, um die Wahrscheinlichkeit einer positiven Reaktion
zu erhhen. Die kanalbergreifende Optimierung verhindert, dass ein Kunde mehrere Angebote erhlt.
Stt ein Kanal an seine Grenzen, wird die Kampagne auf einem Alternativkanal zu Ende gefhrt.
Neben der kanalbergreifenden Optimierung mssen interne Regeln fr den Kundenkontakt sowie
interne und externe Auflagen (beispielsweise behrdliche Ausschlusslisten) eingehalten werden. Damit
wird ausgeschlossen, dass Kunden mit Angeboten berhuft oder ber unerwnschte Kanle kontaktiert
werden.
Ebenso hat der richtige Zeitpunkt einen entscheidenden Einfluss auf den Erfolg einer Aktion. Eine
intelligent getimte Kampagne kann eine Abwanderung zur Konkurrenz verhindern oder den Verkauf eines
hherwertigen Produkts (Up-Sell-Geschft) ermglichen. nderungen im Kundenverhalten, wie etwa eine
sinkende Anzahl von Einkufen, oder ein wiederholtes berschreiten des monatlichen
Minutenkontingents sind wichtige Signale. Es wird automatisch nach nderungen im Kundenverhalten
(Ereignisse genannt) gesucht, die auf eine unerfllte Nachfrage oder einen mglichen Wertverlust
hindeuten, und situationsbezogen die beste Kampagne ausgewhlt. Die Kunden erhalten solcherart
zeitnah Angebote, die auf ihren speziellen Bedarf zugeschnitten sind.
3/5
22
3. Nutzung von Inbound Kanlen fr gezielte Marketingkampagnen
Die Nutzung von prdiktiven Modellen erfolgt heute primr fr Outbound Manahmen wie z.B.
Optimierung von Kampagnen und Kundenbindungsmanahmen. Hauptgrund hierfr ist die mangelnde
zeitnahe Verfgbarkeit von Analysen und Prognosen in den klassischen Inboundbereichen wie Call
Center oder Webseite. Mit Echtzeitlsungen (z.B. SPSS) knnen Handlungsanweisungen und
Empfehlungen an Call Center Mitarbeiter oder einen Besucher der Unternehmenswebseite gegeben
werden. Damit knnen diese Kanle fr gezieltes Cross- und Upselling, Kundenbindung,
Betrugserkennung und hnliches genutzt und deutliche Umsatzsteigerungen oder Kostenreduktionen
erzielt werden.
3.1. Call Center von einem Cost Center zum Profit Center
Bei der Echtzeitanalyse werden eingehende Anrufer unmittelbar identifiziert und durch weitere in diesem
Moment gestellte Qualifizierungsfragen (z.B. geht es um eine Produktanfrage, Beschwerde, Auskunft,
etc.) ein Echtzeitscoring vorgenommen. Dies bedeutet, dass der Anrufer als Ergebnis der in Echtzeit
ablaufenden Analyse, auf Basis seiner historischen und der aktuellen Informationen einem
Kundensegment zugeordnet wird. Das System spielt dann dem Call Center Agenten eine
Handlungsempfehlung auf den Bildschirm, die dieser Person und der aktuellen Situation entspricht. Dies
kann z.B. fr Produktempfehlungen, Cross Selling, Kundenbindungsmanahmen, etc. genutzt werden.
Wesentlich ist, dass der Anrufer nicht irgendein Angebot erhlt, sondern ein auf ihn und die jeweilige
Situation zugeschnittenes Angebot, auf welches er mit sehr hoher Wahrscheinlichkeit positiv reagiert.
Darber hinaus ist es gegebenenfalls sinnvoll neben der hchsten Antwortwahrscheinlichkeit, auch noch
Geschftsregeln als Entscheidungskriterium mit einflieen zu lassen. Das in der obigen Abbildung
gezeigte Beispiel hat z.B. als Entscheidungskriterium den Geschftswert, welcher sich aus dem
Deckungsbeitrag und der Antwortwahrscheinlichkeit berechnet. Aufgrund der hinterlegten Prferenz fr
den hchsten Geschftswert, entscheidet sich das System nun fr Aktion C. Hier ist die
Antwortwahrscheinlichkeit zwar niedriger als bei Aktion B, allerdings ist der Deckungsbeitrag mehr als
doppelt so hoch, so dass diese Aktion sinnvoller ist als Kampagne B.
23
4. Fazit
Um weitere Effizienzsteigerungen bei der Kampagnenoptimierung zu erzielen und das mittlerweile immer
strker berhand nehmende Overmailing zu vermeiden ist das Marketing gefordert ber neue Anstze
nachzudenken. Die aufgefhrten Mglichkeiten bieten Unternehmen die Mglichkeit deutlich hhere
Responseraten zu erzielen, Inboundkanle effektiv fr Kampagnen zu nutzen und damit erhhte Umstze
und Wettbewerbsvorteile zu erzielen.
5/5
24
!"$#%&'(")
+*
,-
./
$0
"
(
1 ,23
4 57698;:<(=05?>;@;A;BCEDF<GIHJ>LK-M0:8L8;NOC/DJD
qr|~}E^;iVsEki_`bmcat2IY^;i\/PIEcQSg_RUTUhumRUVSsjlVSiWYki_X[caZQSi^;\S/t-RUm]_ki^;/ca`aRUWbvwRdicfWsekgi\SEh!xl~!jlyz[km}_g`at2nSoEVSw`aoEsp ca^;_`_{/}vw_RUI^w\Sfv; ^
- yzwjOwmOsv;g\SWY cbeE`aoVSv;RUvwckTkknSVT7Vkv`agkvnQSg~^wW
t2cagg/xvw^woT+WYcahggit2`^;ca`
Q^`a^;T?VEkm`aca^~RUxEg\SR7vLWbcaQR7gRUV\t-gh-k\SvQ/kpoE^w`at2\SRU^w\S\pEc
VS{/cbQQS`a^w^)o\Sv;gkcat2VoV`a^w^O`a\sgkjlkvgkiQhVSVWcaQS`a^w^-!Wb^;VS{S\/kijOca`b^wcax(RUvwkmoQ`aT?^;^km`a`^)kvwkoS]/RUR7WYt2WbcaoSgWkt2kiTURU^;c~`bkiX[oScanS\SRUgoSx\Wb^;RU\`acaWY^w^wcWbvkiW0QS\s\`ax^wR7k/T7nSkmocaTU^^2R7gh\SR7g\Wb`WbQRUR7pxEVFQRUIWYcacWukt2\SRU\v;ca^L^gl`apkca^w\SQS\x ^c
VS^;x/^;TURUcb\Sv~`WbkmkRdcacvLR7ecagR7\gWY\cbgi`ahoScak0^;vLvLcavwQog\S`at^wR7WnoSR7R7\S^w\W+kiQcacaRUR7ggp\-Qt2gixEghRUnSt2zRUTU{^w^-\SjVSWbRUQ]/gg\SRUWb\Skos^2TkTUvwxRUoS~kikmWYccacakOgR7gt2Wb\VS^;kt2`vw^;^;xWwcakiQScg/k/xEI^2Wzki`aR7^w\stVx2ggi/`bhc\cagmcaQSQRU^)TUW^~xEkkpVV^ XX
caVxQS`a^;g^Wbv;kVE`avRU`aQ(V^wca]_RURUW0^wg\\cacaWgORUggx\OhRUWbgvwh!gmQ]_vLgQ/^;ca`oEQS`aoS^r\\sJvxEQ/^;o`ao`aWYca\cVkRU\So\SpcxSkigkinSh!\sTUxg^2oE\/`+g\Sk\
g~VvVQ/TU`a^~oEgxE`akp\Sv^2RUQ\SpkiQ`avw^RUoSvLQ(WYQ/caoSgv~t-t2ki\(ki^;\O`anWRU^2\kmca`aoS^w^WbTU^~TU^wx%pQSRUhngiki^`c
V`agiSckinSTU^+vwoWYcagt2^;`aWki`a^0k\Sx)QevQ/o`a\R7\pRUWQSkVVJ^;\SRU\SpE
J+Ja
QsIyzk~oQ/]`ao^
g`aV\-v^wQst2kki\\S^~kipv;\S^wgxWuoSca\RUQS\cb^`aRUcax^wQSWORUWb^vwt-gT7k\kmWYca`ac2R7_\/^;eoScl^~kikmcaWY`aRUcbWg`a\)oSh?vL`aggcahot`ak0^wvwt2Wlggg\\hcb`gkiVv;Rdgc`aTU^wRUWYTU^wcaRdWbRUcav
WQcacak0g^wnSTU^w]_ov;^;Wbg`bRUt2e\S^wtvwWbgWwot2/\SfRUV\-v~^;kmt-cacaRdRUcakgR7\\S]e ^W
t-n^rki`ak]/^;caRdcWwk T hgkv;ncagi^+`khnSgiTU`^rcacaQg^vwWbgo\`acb]/`aRUg]TkkT\Sgx)ht2`a^~gxnoRUvwT7^^VcaQSQS^rg\Svw^oWYv;caggt2t2Vs^;ki`\SvRUQ/^wWwoE`a[\)h`RUckicaRU^rWt-Vgk~WYe X
WbpRURUnS]_TU^;^
\cacagg/xSVEkw`aeJ^~x W-RUv;vwcoWYcaQg^;t2caQ^;^;`aW`-`ak^;vwv;goS`WYxJcagzt2kVS^LVE`2`aRUgW2VETU`aRUR?_km^wcaTd^ekicav;gcaRUgvLQ/\oE`ag\\RUca\QS^
cfWbgR7x^lt2ggh\cacaQQSWw^
nSt-okWb\SRU\Sk^wpWb^;Wt2t-^w\kwce
kiVE`a`a^^w ]_^w\ccaQS^0T7gWbWrghucaQ^vwoWYcagt2^;`~Jyz^w\cb`kTRUWbWboS^wWhg`rvQ/o`a\
`aQs^kmcacQScf^;e`a^V^wWbWonSgiphI`av;goSoSWYVcaWugt2ghJ^LV`aWgicaki^w`a\^+caTUR7RUk_TS^wvTde)Q/oEca`ag2\S^;`aoSWRdcLcaQskmc!ki`a^Wbg+V`agt2R7WbRU\pcaQskic
ho`bcaQ^;`^LJg`bcaWcag_^w^;VcaQS^rvwoWYcagt2^;`aWki`a^rzg`bcaQQSRUTU^m
Wb^wg\ca/R7\Skg~Tscag0n/^;ucbcak^LV`z`acgkiRUT7cgik`nVT7^u`ag/vwxEoSoSWYcav;gcat2Wuca^;g0`xEV^w`agivwR7SxEc^wkWncaTUg^xt-RUWbkmvw`ag_\^~caxRU\/Wbo^wp^zt2k^;vw\gca\Wwcb/`zkiv;oScWbRURUW\^w^wWbWYWX
ggn/\
YWb^woSv;cavRUQ
]^wv;W+oSWYhgcag`0t2vLQ/^L`aoEW`a\%caQSVki`ac^wkix`aRUv;^+caTURUR7g\^wW0Tde)ki`aca^gcaVEg`agEvwxEgoS\Sv;vw^+^w\VEcb`a`gkmcac^nt-e2ki`a`a^wt-^;cakRU\SRU\SpRU\kpvLcaR7]/RdcaRdcaQlRU^wk W
vw\gt2caVsQR7kW-\e2WbQShgigi`b`cTUgVs\SkipV^;^;``-V^;`a^ORUg/`ax^;WVJgig`bhc)caRUcat2Q^^+kk\SVx)VSTUcaRUv~gki`aca^;RUgc\kiRU\lgih0V`aggo`ck!n{ST7^rjvwoS/WY\cagmgt2TU^;^~`axEWwp ^
hVxg`aRUvwWb^;vwo]_gmW!^w]\^;c`bkevW!Q/cat2og+`a^;\ScaRU\sQ\SgExpEx( /\Scag(g~xSTU^~kmxEck%p^h?`agQStRUvQ-kvw{/k\2RUnJWbW-^t2oWbg^~nxRUT7ca^
g0VVEQS`a^~gx\RU^
v;czvwkg\St2xVSoSk\Tdcae_RUt-!jlkicak^wRUTd\e
25
uJuJJ0J
gQSI hRUt2pQQ^;R7p`axEpQRU^wt2\xc^wRUt2\S{/Wb^wcaRUo\gv;Wb\SR7cakgoET\s`ax^!kiTJkijlcxSk+kikiVScgk/W\cag!!{S{Skrjjcz_kig `a^xEFR7vwt2WbgQS\S^;g~\SWYcbWb`aRUxEoSg\SRUvLWYcakc^~TSkx-\`a^;vwgca^!\-RU\skcak/\sgEx+V)xEg^wh`a\S^;kcaWbRURdVEcf\ser`ak0giWYYRUcbW^;`av;ok0cav;RUWbgca^Lo\c`a^;ggW hh
SWb]_g^wEvL^~v~caxkigT7`LVTUL^~gix%WbuRdca\t2RU^wg^;o\`a`apgg^w\S\\WrccaQSkiWb^z`b^w`Tdshsk\\gRd`apcap_^~^kix%pi\S`agRUwR7\xRU\k)kpr\Spit-x`aR?kkxJV]_W ^wu v;zkcavgi{Q`u+RU\\jO^wcaoQ`av~^gki\%\xSkinRUcW+^zkkoWbWbVsWbWb^~gkx+vwv;R7^ cakigca VE^~vwx%`agg\ScaWYgcbRdcfca`aeEQoSVvLk^c
ca QSQS^2^VEo`aWagikYp^;^v;cagRUhg\z{nS+oEjcgQscakQW^;ca`QVE^`akgixY^;]v;kca\RUcgki\3p^zt2ca^LQscakmQScIg/VxgW+cav~^wk\\caR7kikTETUWbvwgTUoSWYnca^2^;`aoSWIWbRU^w\xcaQScag/^gxSkm ic[k
kiki`a`a^^]/VERU`aWb^wosWbki^;`aT7RU]_;^w^~x3xOoS~Wb9RU\Sp2\OZcaXzgkV%\sxlgh!PIcaXQSjl^ki`acb^;`acaRdRU
\SkEca^wvxEQR7WY\ScgkiTU\Sgvwp^RU^wWwki\syzx%TUxEoS^wWYca\S^;Wb`aRdcfWeRU\lWYcbca`aQo^v;caxSokm`ac^;k W
v~kkiTUp\gi`anJRdca^QSthgo \sm9x oSWbRU\SpZ+yzkx^w\WbRUcfeki\sxxRUWYck\vw^x/`aRU]_^w\Wb^wpt2^w\ckicaRUg\
RU
xWRUWb ca~VQ;T?^kwse_Z^~Xx2jlg ki\2cb ~`acaRdgJV2s g hFQS^~^ kiTUvLgQ-v~!k\ET^w oxEQ`aR7WYg^!c\-kiv~\Skk\vwW^;gkiW\S\-RUnv~^w^;kiTUcTE^;z]xE^wkmR7^;WYcac\RUkig\S\)\Svw^;vL^!R7p`a]/^~Q/kiRUnWbcaosgRU\SokpTU`aRURU~gk0km\ScapR7qg\S\T7^wkicaoE^w\s`avxgQSWb\S\SvwWkR7kiVoS`a^^^
ZXjlkmcb`aR7v;^wWQsk~]^caQS^hgT7TUg~R7\Sp2V`agVJ^L`bcaR7^;Ww
\Svwg^wt2RUpQ/t2ngg\)o`a]RUk\STUp2TU^;exkickRU\
caQ^+QRUpQXfxER7t2^;\SWbRUg\SkTFRU\SVocxkickWbVskivw^0TURU^0RU\lk
p_jlkkiVcbW2`aRdR7 \3caQS^lxERUWYcb`aR7nocaRUg\gihrRU\SVocWakt2VSTU^;W2VE`agExEoSvw^QSRUTUTUW-g\3caQS^
ZX
WYn cbe`aEoSTUgvLv~caokaTs`a^wt)W+ kmRUE\% ~R7t-caQSSk^ RU\-xSkmxc^wkE\!WbfRd\ca^w{EW!t2gcaRdt2QS/kca^Lo\ca`aR7t2n^e^;xWOxkiRUcWYxkcRUkWYWbc\^Lkivwca\SW0^wvwWwvw^w/TUWoWYkicaP`a^;^X`ajlW\kikmgcb`ac`a^Rdl^;`\Skm gicaQSJo^;px`QRUWbxEVSca^;T7gkwseE\WbWz^~^wxk^
TUt-gv~kikcaTFRUgx\%^w\gWbVRUcacfRUe2t-t2kT^~k_Wb^;oE`a`a\S^^;TuoSxWbRU^;\S\SpWbRdPceki`a^w^LWYcacagRUt-qrki^;ca\SRUgWbRd\Fce-I\WYcakRU\St-kkiTUgcaRUpge%\ ca gPucaq+Q!^LZEkXjl\)RUki\Ecbh`agRdJ`bX
PXYjlkmcb`aRUvw^w Wkm `a^0wwkSTUWbg-xRUWbsVS T7kwe^wxlEkiWiq T7k~\SxSWbv~ kiV^w Ww u/
nSZRUX\ jl^w WzkmcbxE`a RURUWY)ckk\Sv;v;vw^g `kxE\SR7x2\pxE^wcat2gca\SQSWb^RdcfePIXRU\jlhgiki`acbt-`aRd%kicaRU g9\2 neQS ^t2]g/kxE TUoSRUh?^weEWR7\gprQSh^ucacaQQSZ+^^~QSZXjl^;XR7jlpkiQcbkica`acbW!Rdr`agRd)vwhgcakitQS`a^^ X
TUkg~\Sx^;`^wx
t2R7VS\
QSQSkWbRUpRUwQ^~Tdx)elR7xE\
^w\SWbVSWb^ki`a`aWb^w^+pRU`ag^;\pWwRUgs\oSWw\S vQSk\Sp^~xRU\
`a^wpRUg\SWghIkm]^;`kp^x^w\WbRdce_
cavw^;TUoS`aWwWYcaFw^LwZ+`/ny g`oSxWbJ^L^w`a WWQScaki^Q\s^Z0x
RUivw\EyTUhoSgWYv;`acat-T7^;o`WYkmcacav;^;R7gg``a\k^wWwTUgp hzgi `aZRUQcaXRUQjlWt`akm^;cbWb`a~oSRUFTdcaRUWzWk\SoSRU\x%Wb^~k2xPIXxhjlgiRUWY`kicxkcb`a\^LRdcavwO^w^v;cacakigRU\sgvwxl\)g\SgixhWY^wcb\v;`aT7WboSoRdcvLWYe cX
nsvwTUkioSWbWY^~cax%^L`avwWwTU oS WYcaQS^;^`ki\/TUpoSgt`aRdncaQS^;`rtghcaQSvwkiTUoSc+WYRUcaW0^;`akW0nSRUTUW^kicagocavwgTUt-oWYkicaca^;RU`0v~k^wTU]_Tde^;\x^LTURUca\S^w^wv;kica`^~x\gkc+\SxWb^wgVSokica`TUkRUn^;`aTU^W
ki`a^+R7x^w\caRdS^~xF
g`t2g`a^kVVSTURUv~kicaRUg\WghZ+~Xjlkicb`aRUvw^wWk\sx2caQS^Z+ykiTUpg`aRdcaQStWb^w^ ! ~[
hg\oS\ScaQSxF^% kiVSVSTURUv~kmcaR7g\xE^wWbv;`aRUn^~xnJ^;T7g~
(vwTUoSWYca^;`aWghvwoWYcagt2^;`aW)vwgoST7xn^
u rJJa ;J r
+\Sg~TU^~xEp^
qrRUWbvwgm]_^;`be +q0vwk\n^lxE^;s\^~x3kiWcaQ^V`agvw^wWbWghrxRUWbv;gi]^;`bX
RU+\Sqp\ca^;g2+xkioSc\skxEWb^;g`aoWY`acvwk^w\SWxSvLki`anS^~TUki^cak^~\Sxx2n/oSe-Wbca^LQhoS^rTsvw/oS\SWYg~cagt2TU^~^;xE` pi^nR7oS\)WbRUx\Ski^;cWbk+WWbRU\^;caca^;W`kimv;9caRUg\lVSVSt-TdeERUpRU\SQp c
26
cacaQSgt2^L`a^L^;`~hgi`a^0hnVS^rki^;`bJcaRU^wvwv;ocaT?RUkm]_`^;TUeR7\cao^LWb`a^~^wx)WYccagRUW+kcav~g
oS/Rd\S`a^+g~ ]kiRUT7\oSknn^;TUh^+gi`aE^w\Qsg~ki\sT7x%^wxp^0QSRUkvnQgvwooScWYcakgvwt2oS^;WY`X
{/cbRU`aTUTov;caoSoRdc`a^;kW2\Sxx)RUWbvwgmQs]ki^;c`aca^~Qx^rnt2eg!caRU{]_j^wWht2gi`RUpQoScRdcbncaRU^
\Sp\S^;km+`a^u nocRU\p^w\^;`kTcaQS^;e3ki`a^
`a\St2^wgiVE^wc0\`a^wcaRUt2WWb^wt2k\nc0^~gx/oR7\Sckmg~caca^wQSTdTUe^^~xEoxpWbki^^;chk2eo^;TIWbc~^;hFcgi`caQSWyzkicr^jki`ao^\s oxQS\s^L^`axEWY^;cWY`akcbWY`a\scoxkv;\ScaRdxSoc~kF`a^wnEWrTU\^ng~neOelTUca^~QSQ/x^wopt2t-^2Wbkt2^;\ST7^w]Wwk^w\SW+caWrQsxkmWYgcrckm\Skicagi`a^;^cX
kkTUWbt2g2^~RUki\/\ScaRU^L\S`apV`aca^;gckicanSQSTU^^+nSnoe
WbRUk\S^w/Wb\SWwg~ TU^^~xE/p\S^0g~nSkTUWb^~^~xExp^rWYetWYcaoS^;WYt cTUk^~\skx)xJScaRUg\
kVs\Ski`bgca\RUvwX9cbo`aT7RUki]/`~R?skiTJQSoSk~]_\E^ X
^ca~x^wxE^Lv`apQSWY^c\zkR7/\sRU\oxE^wRUca\SW+QSpRU iW-g9h Wb^;RUt2\SWbV^gz`bv~ckiki\\cnh^~^lkmca^;o/`acb^w`WIkvLgcahs^~caxQ^h`axSgkmt ck!p^w{S\j ^;`kiWYcbca`aRU\Soprv;caVEo`a`ag/^;v;W-^wWboSWwWb RU\S+p\Sg~WbR7pTd X
xoQSvw^2^wWkTUkpgx`a^;RdWbcav;QS`atRUVcaWbRUgRUp\gighVJcaQ^L`^wkiWbca^!^wpiW0`aggo\VScaWQRU^\0picaQ`ag^IoShVgi`aW0tR7xgi^;hS\/ocaRd\sSxE^~^;x%`aWYnce%k\SZ+xSknyTU^!kix\s^wx%vwRUVWbRU`agg\ X
`aRU\oShTU^;^w`aW0^w\S mv;9^sE\
caQSvw^goS\\Scb`xki^;WY`acWYccakig\sgxScakiQnS^;RU`TURdcxe-^wvwgiRUhWbRUcagQS\l^`apo^wTU\^^;ki`kiTUpcag^~`ax)RdcaRUQWt-caQSWw^STUt-RU_k^0RU\x^wkvwRURUtWbRUgg\lhcbca`aQS^;RU^W
gWbkv;hzTUp`acaRUgiVQ`aca^wRdcaRUWbgQS^\tlpihs`agig`o^~VSRdkcaW+vQOQvwcagpQoS`a^T7gx%/oS\SVng~^RU\)^;TU/^~cacbQxE`^pk^v;hgca^;^~`a/txOcb`hkigi`av;hgcatvRUQsg\%kmca`QSkk^v;TUpcazgi^;{`a`aRURUjWbcaRUQ\Stp2{/R7kiWbp\sRU-p/x)Vx`a`aRdg/oSJxETU^;^woS`aWvw^;h^;\gW0ca`rR?k
km^~cakiRUxE\Sv^;Qp X
`a oSQSTU^w^+Wwqr kickoWb^~x)hg`caQSRUWWYcaosx/e)vwg\WbRUWYca^~xlgihukWakit2VSTU^rghI / 2vwoSWYcagt2^;`
Tca7xSk^wkm\STUc^wxFkvw!g`at2^;vwt^Og`ooSxE\SWbWwRU^~v~ xkm caR7QS
g^
\xS]\Skikm^;`accfR7zkkgn`aTU^wkWW2 v;VEgQ`a\Sgm^v;]/]^;R?kmxE`a\S`a^~R?RUxki\SnSp(nTU^weoSW!Wa{xEk^wpWb^RUv;Wb`aWbgRUvwnh+g^~tcaxQS^k WbV{/^;uv;RUcazWbWuWb^Lvwg`ag\ht vw{oSWYj
caRdggcat2wnS^;RU^;`bTU^`X
n^w\^wcQSkm\S]/^;RUcfgoEgi``aTU/RU_WwJ^oSWbVWaki^wp\^czcat2RUt2g\S^w^LWek_\SoxWakxp^w^WYcagiRUh\SWbki^;ca`aRU]/gR7\v;Wr^wW gh! TUT7RUg_\S^p{SjOxRU{SWYcLk_\oSvwWa^kipv~^kTUgTUWwhJxRdJRd^;ca`bQ X
ca!QS{S^+jxEkiZ0ckEi yki\sx-WbRUp/+ca^wvQS\gTUgpe_E
0t2^~k\R7\phoSTp`agoSVSWvwgoT7x)n^hgoS\Sx)RU\
r0F" !JuLJb" #%$ Ls &IFu' 0(
a!0
cbcayz`QSQ/k^ovLcuWb`a^w\vw^;gRd`aca\sWQ-xz{/V^;`a`a^RU^;WbvwWbxE^~v;xE^;gSR7t\\Sp^~j
xlt2gkgnSW\RUTUcavw^Q+oSRUWY\)-cagt2t2,g^;0\`acaW*Q caki)QSWkiWbc^wTUxg^wRU`IvLWbcavwca^~gQSxl\^wcaWbkiR7^\s\/v;ox)oS^~v~WYxOcakgTUkiTUt-^~cxO^LTU^~`ayzWkWYQ/cacQSoE^`ag\\SxS^0kmq+cvwkikgc\EgkEXh
QSQSRU^WWbyzR7pQE)oE`a`a\
oT7q+^;W0kmchgk`0cakQSW^2kxETUWbRUJg2^Lvw`aT7^wk\Wbc0WbRdpS`a^~gxlooSVSWbW+RU\zp2^;`az^-{kj\Sk TdeWb^wx(ki\sxRdc0kW0hgo\sx
cacaQsQS^;kmWbc^0capiQ`a^gWaoSkiVt2W^vwg]oSkmT7`ax
R?kinnS^0TU^wkWpzp`a^;^w`a^pkicacaQ^~^xlt2caggWYvwcT7kiR7Wbt2Wb^wVWgi`bcRdkca\Qlcvwhggt2`rt2Wb^wg]\)^;`hk^~Tkmpicao`ag`aoS^wWwV WwF{EQg^
pi`agoSVWvwgoT7xn^kpp`a^wpkica^~xca/
V`aQS^w^zxRU`av;ocaTUg^w`aWIWph^wg\S`^Lv`Q/kicao^~`ax\SRUn\e0pEWbRUp/\)hgmgi]_`I^Lca`g QSk.^TUTyzvwkiT7vwQ/kivwWbooWb`a^w`\Wukiq+v;Qse-k~ki]/chgikRU\S`capoEcaQvw`ag\S^t2^~VEx0t2`a^~ggxoE\0RUcv;oScacaRUgWagki\np^z^`ao^;VJTU`a^w^wgWv;ScaTURU^w]kiWw^W
t2QS^L^~`aki^WbRUo\-`a^~gix
`xca^;g-`!ncag0^ .^w.E^w 1V22)vwg \ QSxE^^w\gmca]_R7^;k`TUkRdcTUeTvnSQ/oEocz`aRd\lcz`WbkmQScag^oT7R7x2W\Sng^ct2cag2^w\nJcaRU^0g\SVS^woxnSTUcaRUQSWbQSki^~czxlgoERU\ `
t2^;EcaQ\gEg~x2T7ca^wox`ap\S^O^wxnskgWbo^wcxcagWYeEv;WY`aca^~^wkmtca^0v~^;kiJ\^;v;R7ca\RUca]_^L^+`aVVE`a`a^;^~c-xRUcav;Qcagi^l`aW`aoShTUg^w`W)vpQ/^wo\S`a^L\`RUki\Scap^~ xneWbRUp/_
Wb VQSQS^wRU^
vwW0Rdx`ac^;^weWbWbv;ogi`aTURUcahVWcacaQRURUg\^r\vwk
T7gkvwhWbT7WbkcaRdWbQSsWb^)^LRdS`
^;RU `WvwhpigTU`ag`gWboS^+t2VcagW3g n`aRU4T7^w^5Wb6VVS2^wQSvLhgcagR7\S`r]^^wkTdvwTU7e TFoS`aWY.oScagTUvw^wt2T7Wwk ^;Wb`aWbWw^wW2{/R7\3^w\ScaWbQRdca^)RU]/hRUgcfe`at k\Sgx h
`avwoSgt TU^wWt2kTUgTUgmnSRUTU^2WcaVSg-QgoS\\s^2xE^;\S`a^LWYcczk\Sgx
`a QS kiQSc^2cfvweEgV\^0vwTUgoSh!WbvwRUgo\(WYcagt2kW^;`ax/W`kmkw`a^\(ocaWbQSR7\kip-ccavwQo^WYca{gt2RU^;Wb`aWYWX
27
acgxQshRUWYki{cb\`aRUnnSRUWberoEWbcavwcaRUgQSgt ^;\(Rd`gj
WbhggcavwnSQSR7RUk^2TUTE^rgpiki``a`ag^w^ov;`VSgkiW\ScagQkt2^;t2`RUgvzv\SQsnspkmki`vLyzk/v;Q/picaoE`a^Lg`a`aoS\SR7Wb\S^;^w`axFx)W8nke\sRU\Sx:cakQSTU9^Tdego\3Wakkv;yzpg^Q/t2ogiVs`ahF\kmca^;`aQS`aRU^WbW0g\S\`a^;^;cfgi]_zh^~kgcaQSTU`aWw ^
QSWbv;g~`azRUV^wca]_RU^Lg`~\3R7RU\\(ca^Lca`a^L^w`aWYt2caRUW0\SplghVE\`a^;gcfV^;gi`bca`aRU^wWwoWa{EkRUp\^2vw^2caQSk^TUTu`ap^~`akgWboSg\SVSWW+Qsghk~]kl^2vkQEoEt2`a\S^~kiRU\\SpORU\Sxpi^whovwT!RUWbRUxEg^;\ X
v~ki\ln^+o\sxE^;`aWYcagg/xF
; $ L SsS"# r=<
voS Q/WbQSoRU\^`apl\SVERU`a!\^wp{SWb^wjRU\\caF^~kZ0xVsikkmy`bVcaVkRUvw`a\soSgxkT7kiv/`Q\SnSg~koRUt2WbTUR7^~\Wrx^wkipWbcWw^ca^;PQS/`a^cbg/`oSxk\Sov;caxv;RUc^;g`a\ giWYhckicaWb\sQRUp/xE^iR7t2\rp-^;RUW0cagQSkh!g/caxWbQ^;^Vc+`aVg^whzQSWb^w^;`a\\SoScagTU^w^;t2x-Wr^;caQS\SQs^;gki`a\ ^c
TUv^~Q/kox`a\cag+gk`+n\^;gcbcca^;ca`ug)vvLQSQ/kioE``ak\v;ca ^L`aoRUWa`bkicacaQSRUg^L`a\-t2gigh`aca^Q^kt2v;oSgWY\Scap-gt2caQ^L^w`uWbpi^`agpioS`agVoWIVScaWrQSkiWb^wcupkit2`a^^w\TUR7caW^wTdcaeQscakig c
QSkix`a^L^L^`a`a^R7]t2^~vwgxg`a\S^0nv;el`aV^L`acacag^QS^vLcQSkRU\nkica`T7^;^0ki`av;caVcaQS`a^;k^;`ac\ORUWakmkicagcaRUcagRUQg\\S^;`aW+gWh!giv~cahzkiQSk-\^T7n`akioS^`apTU^wR7^xWw^w> \/\oScaotRdS`bnca^~Q^;xF`a^;sW`at2\lgh!gca`avQ^Q/^oVEvw`a`a\kg/Wb^;x^`aoWVv;vwcr`ag^wpoWb`a^;T7gx\ocan^~VSx ^W
vwkigcbcboS`T7kix-v;canRU^]_^R7xEca^wg\nScaRdosRU^~T?x2xocaQSV)kik0czzTUg^;\`ap^^;T7`RU_T7k^;TdWYe2caRU\ScagpkmnScboScbWb`RUk\v;^wcWbW\S^;v;oSWYvwcaogt2WYcag^L`zt2`a^;^w`aT7W!kmcanSR7goE\ScWbkiQ`aR7^VFE\Sgi\ c
Vst-kmk`bxEcaRU^vwoSVST7Qki`g\Sv;oS^WYcav~gkit2TUT7W^L`ahW`agcaQst+km~ccaoSg3Wb^wvwx ^;`bcgik`RU\ xR7vwx+gQsoSki\`cbxE`aTdRUe+^wW
oSWbvw^~gFoSvwT7x^;`bcnk^%RU\R7xEWb^;^w`a\]/caRURdvws^w^~Wxki\scag3?x mng^`
vvwQ/oQSoWYgca`ag\S`at2^L^w`at-^;Ww`akWwRU{E \SV^w^wxlvwR7gki\STTdgelJWb^;Q`aWg`bvwcgcaoR7T7t2xl^+nvw^0oR7WYxEca^wg\t2caRd^;s`a^~Wxnca^wQSv~kikit2c^+^;ca`aQS^0^L`an^;ghoSgi`ap^Qco\SnVe`aVgiS^;`acWbkignS\STU^W
Vca g`aQSt2g/^LxE`a^LoS^;`ahWvwgiRUg\`a^oEpJcca\giQS^;h^0 caVQ`a^+h^wgxSWb`a^;kit2\ccak^L^~`axOpTUe^wt2\oS^;^;\`caki/Qca\Sg/^~g~xlx
RUn\We-/vw\S/oSg~\SWYg~cagTU^~t2TUxE^~^;pxE`b^pX[^%nSxoSRUkWbWbnvwRU\gmg^w]oEWb^;c
W`beOR7k(\RUca\On^LoS`cakWbQRUv;\S^ca^;RUWbgWb^w\FWw\ Wbvw^oSgWYXh
@ A wIJuu
0jl{ ki\skipu^w`at2gm^w\\ c~u uxs{cb`2kicayz^wopWYRUvlcagt2t2^;V`^;`km^wcaRUT7]_ki^caRUgRU\3\WbQScaQSRUV ^
E gi`a T?x ghlIXfzoWbR7\^wWbWwC BgQ\ RUTU^;-e D{Eg\SWw
/ g:V+gT7ggpQSR7gvw\Sk^wTUTU\Feyz{Egi^;`bTU`ah?XY^wv;Fc `apSk^w\kiR7ca;o^~x' a
` ~
^ l
j Sgi`at-kicaRUg\ gh
kHV./Ww> !4.z1RUgTUgpRdX
w
v
k
T !
y E
e
n ;
^ a
` S
\ L
^ a
c 7
R ;
v W GugT S_/sVVOE .(
E+z0g0`a/ZWTUcaWbvRdQFcaQ {eEQSt^nfg\T7caRU^wJv p+`km\Scag~RUg\ TU^~xgih3p^9P^w`aoEg`vwk^wTIWbWb9RU\S^;pcbX
X
SRU{/\v~Vki`aqrcaRU\SRUR7gxSp\^;kw` e3kGu\S^;^Lc)x `aT?kkiq+T p ( kicK5Mk.9.^; \SkTdeEVWbVR7WL`aKEg_kivLVSQV ^wW2RU\yzX-T7k_WbiWbERdX
+h0g`zZvwTUoTdcaWYWbcav^;Q`aARU\Szpu_]/j R7WbgN oS`akvQSTURU~^w\FkicauRUgz\{/+k\SjxXjlv;T?kikWbVSWbWRdsXv~kmcacagR7ggTU\ W
HERdcaQqr^wIVEt2c~E^;gi`aphu^wjl\ckmca{SQS+^wt-jkm caRU vw^wWvkQS\s\Sx
RUv~yzkiTgt2^wVSVoEgca`b^;+c`{E9v;gRUX
/+^;v;0\S^ww^~vwZx^RUTd\caZWbpvW0\QRU]_myz^;g`aTUWb`aoRd/WYcWbe)caQ^;`agigRUh!V\Sjlpg\km`aRdcanS{EQ^woETd`a{ShpX+=j`ap^;wk`aZ+\St-RUwkyRU\\~e_p%fS\0jl5Pk`aVSg WX
{S+j 5LPkm`aRUWw=`ki\Svw^5
28
Online Ensembles for Financial Trading
1 Introduction
Financial markets are highly dynamic and complex systems that have long been
the object of different modelling approaches with the goal of trying to anticipate
their future behavior so that trading actions can be carried out in a profitable
way. The complexity of the task as well as some theoretical research, have lead
many to consider it an impossible task. Without entering this never ending
argumentation, in this paper we try to experimentally verify a hypothesis based
on empirical evidence that confirms the difficulty of the modelling task. In effect,
we have tried many different modelling approaches on several financial time
series and we have observed strong fluctuations on the predictive performance of
these models. Figure 1 illustrates this observation by showing the performance
of a set of models on the task of predicting the 1-day future returns of the
Microsoft stock. As we can observe the ranking of models, according to the used
performance measure (NMSE measured on the previous 15 days), varies a lot,
29
Fig. 1. The performance of different models on a particular financial time series.
which shows that at any given time t, the model that is predicting better can be
different.
Similar patterns of performance where observed on other time series and
using several other experimental setups. In effect, we have carried out a very large
set of experiments varying: the modelling techniques (regression trees, support
vector machines, random forests, etc.); the input variables used in the models
(only lagged values of the returns, technical indicators, etc.); the way the past
data was used (different sliding windows, growing windows, etc.). Still, the goal of
this paper is not the selection of the best modelling approach. The starting point
of this work are the predictions of this large set of approaches that are regarded
(from the perspective of this paper) as black boxes. Provided these models behave
differently (i.e. we can observe effects like those shown in Figure 1), we have a
good setup for experimentally testing our hypothesis.
Our working hypothesis is that through the combination of the different
predictions we can overcome the limitations that some models show at some
stages. In order to achieve this, we claim that the combination should have
dynamic weights so that it is adaptable to the current ranking of the models.
This means that we will use weights that are a function of the recent past
performance of the respective models. This way we obtain some sort of dynamic
online ensembles, that keep adjusting the ensemble to the current pattern of
performance exhibited by the individual models.
There are a few assumptions behind this hypothesis. First of all, we are as-
suming that we can devise a good characterization of the recent past performance
of the models and more over that this statistic is a good indicator (i.e. can serve
as a proxy) of the near future performance of the models. Secondly, we must
assume that there is always diversity among the models in terms of the perfor-
mance at any time t. Thirdly, we must also assume that there will always be
some model performing well at any time t, otherwise the resulting combination
could not perform well either.
Obviously, the above assumptions are what we could consider an ideal setup
for the hypothesis to be verified in practice. Several of these assumptions are
30
difficult to meet in a real world complex problem like financial trading, left
alone proving that they always hold. The work we present here can be regarded
as a first attempt to experimentally test our hypothesis. Namely, we propose a
statistic for describing the past performance of the models and then evaluate an
online ensemble based on a weighing schema that is a function of this statistic
on a set of real world financial time series.
where F.15k is the value of the F -measure calculated using the predictions of
model k for the time window [t 15..t].
31
Table 1. The results for the DELL stock.
1% 1.5% 2%
Prec Rec Prec Rec Prec Rec
lm.stand 0.405 0.077 0.472 0.017 0.364 0.006
lm.soph 0.453 0.173 0.460 0.045 0.519 0.014
randomForest.stand 0.434 0.227 0.402 0.080 0.328 0.026
randomForest.soph 0.420 0.321 0.399 0.154 0.345 0.075
cart.stand 0.444 0.004 0.444 0.004 0.600 0.003
cart.soph 0.231 0.004 0.154 0.004 0.154 0.004
nnet.stand 0.388 0.246 0.353 0.128 0.289 0.075
nnet.soph 0.453 0.063 0.360 0.037 0.323 0.030
svm.stand 0.391 0.380 0.357 0.190 0.326 0.075
svm.soph 0.415 0.360 0.397 0.189 0.373 0.097
Ensemble 0.438 0.451 0.354 0.287 0.296 0.210
32
in terms of Recall get more impressive as we increase the thresholds, i.e. make
the problem even harder because increasing the thresholds means that trading
signals are even more rare and thus harder to capture by models that are not
designed to be accurate at predicting rare values. Still, the Precision of the
ensemble signals also suffers on these cases, which means that the resulting
signals could hardly be considered for real trading. Nevertheless, the results
with the thresholds set to 1% can be considered worth exploring in terms of
trading as we get both Precision and Recall around 45%. The best individual
model has a similar Precision but with only 17.3% Recall.
The same general pattern of results were observed on the experiments with
other stocks. In summary, this first set of experiments provides good indications
towards the hypothesis of using indicators of the recent performance of models
as dynamic weights in an online ensemble in the context of financial markets.
4 Conclusions
In this paper we have explored the possibility of using online ensembles to im-
prove the performance of models in a financial trading application. Our proposal
was motivated on the observation that the performance of a large set of indi-
vidual models varied a lot along the testing period we have used. Based on this
empirical observation we have proposed to use dynamic (calculated on moving
windows) statistics of the performance of individual models as weights in an
online ensemble.
The application we are addressing has some particularities, namely the in-
creased interest on accurately predicting rare extreme values of the stock returns.
This fact, lead us to transform the initial numerical prediction problem into a
discretized version where we could focus our attention on the classes of interest
(the high and low returns that lead to trading actions). As a follow up we have
decided to use Precision and Recall to measure the performance of the models
at accurately predicting these trading opportunities. Moreover, as statistical in-
dicator of recent past performance we have used the F -measure that combines
these two statistics.
The results of our initial approach to dynamic online ensembles in the context
of financial trading are promising.We have observed an increased capacity of the
ensembles in terms of signaling the trading opportunities. Moreover, this result
was achieved without compromising the accuracy of these signals. This means
that the resulting ensembles are able to capture much more trading opportunities
than the individual models.
As main lessons learned from this application we can refer:
33
Predicting extreme values is a completly different problem: we have con-
firmed our previous observations [3] that the problem of predicting rare ex-
treme values is very different from the standard regression setup, and this
demands for specific measures of predictive performance.
Dynamic online ensembles are a good means to fight instability of individual
models: whenever the characteristics of a problem lead to a certain instability
on models performance, the use of dynamic online ensembles is a good
approach to explore the advantages of the best models at each time step.
References
1. J. Barbosa. Metodos para lidar com mudancas de regime em series temporais finan-
ceiras - utilizacao de modelos multiplos na combinacao de previsoes (in portuguese).
Masters thesis, Faculty of Economics, University of Porto, 2006.
2. C. Van Rijsbergen. Information Retrieval. Dept. of Computer Science, University
of Glasgow, 2nd edition, 1979.
3. L. Torgo and R. Ribeiro. Predicting rare extreme values. In W. Ng, editor, Proceed-
ings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining
(PAKDD2006), number 3918 in Lecture Notes in Artificial Intelligence. Springer,
2006.
34
Bayesian Network Classifiers for Country Risk
Forecasting
Ernesto Coutinho Colla1 , Jaime Shinsuke Ide1 , and Fabio Gagliardi Cozman2
1
Escola de Economia de S ao Paulo, Fundaca
o Getulio Vargas
Rua Itapeva, 474, 12nd floor - CEP 01332-000, S ao Paulo, SP - Brazil
2
Escola Politecnica, Universidade de S
ao Paulo
Av. Prof. Mello Moraes, 2231 - CEP 05508900, S ao Paulo, SP - Brazil
1 Introduction
The forecast of inflation, employment, economic activity and interest rates (that
is, of macroeconomic variables) is extremely important to decide on corporate
investment decisions, government policy and family consumption. Decision sup-
port tools that can handle such variables and construct economic forecasts are
clearly useful.
Country risk ratings are important macroeconomic indicators as they sum-
marize the perceptions of economic agent about economic and political stability
[6]. This perception affects the countrys direct investment flows and interna-
tional loans [16], thus impacting on its domestic economics activities.
Our goal in this work is to forecast daily country risk ratings behavior using
pattern recognition from economic and financial variables. We believe that this
forecasting is very useful to build corporate, government and financial decision
support tool. As the risk of a country cannot be directly measured, we have
chosen to forecast one of the most adopted indicators of country risk, called the
Emergent Market Bound Index Plus (EMBI+). This indicator is provided by
J.P. Morgan and tracks total returns for traded external debt instruments in the
emergent market. We use data about Brazil.
We employ Bayesian networks and their machine learning algorithms [10, 14]
for economic forecasting based on pattern recognition. The present paper is a
preliminary report on the promising (and challenging) use of Bayesian networks
to rate country risks. We should note that other machine learning tools have been
35
investigated in the literature on economics. For example, neural networks have
been extensively studied and successfully applied in finance [17] and economic
forecasting [13, 3, 15]. We are interested in Bayesian networks as they have some
advantages over neural networks, such as the ability to combine expert opinion
and experimental data [10] possibly the reason why Bayesian networks have
been successfully applied in many areas such as medical diagnosis, equipment
failure identification, information retrieval.
36
The database included raw variables and built variables, for example, ratios and
first differences obtained from original variables hence obtaining at the end
117 different attribute variables.
Name Description
Economical activity
PIMBR Industrial production index based on the monthly industrial physical production
PIMBR_DZ Industrial production index based on the monthly industrial physical production
seasonally adjusted
Fiscal balance
DLSPTOT_PIB Public sector net debt (% of GDP)
NFSP_TOT_NOM Total nominal fiscal deficit of the public sector (% of GDP)
NFSP_TOT_PRIM Primary result of the public sector (% of GDP)
Monetary variables
SELIC_12M Annual Interest Rate (SELIC) (% year)
As our purpose was to predict the trend of EMBI+, we took special care in
organizing the dataset so as to guarantee that the data used to prediction was
exactly the information that the decision maker would have in a real situation.
Figure 2 shows the EMBI+ time-series. The period between June to December
of 2002 is specially volatile because of political factors (election period).
2. Data cleaning. We cleaned the data (using a Perl script) by removing
samples with missing values, obtaining in the end 1483 instances (note that
we could have used all data by processing the missing values say with the EM
algorithm [2]).
3. Filtering and discretization. The exploratory data analysis showed
EM BI+ high daily volatility, so the target classes for the classification process
were created from a nominal discretization over a filtered EM BI+ series. The
main reason for using the filtered dataset instead of the raw values is based on
our intuition that for corporate investment and policymakers decisions it would
37
2500
2000
1500
1000
500
0
1/1/1999
3/1/1999
5/1/1999
7/1/1999
9/1/1999
11/1/1999
1/1/2000
3/1/2000
5/1/2000
7/1/2000
9/1/2000
11/1/2000
1/1/2001
3/1/2001
5/1/2001
7/1/2001
9/1/2001
11/1/2001
1/1/2002
3/1/2002
5/1/2002
7/1/2002
9/1/2002
11/1/2002
1/1/2003
3/1/2003
5/1/2003
7/1/2003
9/1/2003
11/1/2003
1/1/2004
3/1/2004
5/1/2004
7/1/2004
9/1/2004
11/1/2004
1/1/2005
3/1/2005
5/1/2005
7/1/2005
9/1/2005
11/1/2005
1/1/2006
3/1/2006
Fig. 2. EMBI+ series of period January/1999 to December/2005.
38
EMBIBR EMBIBR
EMBIBR_HP01 EMBIBR_HP1
2400 2400
2200 2200
2000 2000
1800 1800
1600 1600
1400 1400
6/21/2002
7/5/2002
7/19/2002
8/2/2002
8/16/2002
8/30/2002
9/13/2002
9/27/2002
10/11/2002
10/25/2002
11/8/2002
11/22/2002
12/6/2002
6/21/2002
7/5/2002
7/19/2002
8/2/2002
8/16/2002
8/30/2002
9/13/2002
9/27/2002
10/11/2002
10/25/2002
11/8/2002
11/22/2002
12/6/2002
Fig. 3. EMBI+ series of period (June-December of 2002) and its filtered series with
smoothing parameter = 100 (left side) and = 1000 (right side).
39
Table 3. Standard deviation of EM BIHPt+i for different smoothing and leading
times i.
That is, if we consider EM BIHP = 1000 and Interval 1, the category stable corre-
sponds to values between 997 and 1003. For i = 20 (prediction for the next month), the
standard deviation of EM BIHPt+20 is sd = 12.3%; and then, the category stable
corresponds to values between 877 and 1123.
4. Training and testing. After the previous preprocessing stages, we used the
data set D to train a Bayesian network classifier. We used the free software WEKA
[2] to train and test D, including the continuous data discretization. The dataset D
was divided into training and testing data. Fayyad and Iranis supervised discretization
method [9] was applied to the training data and the same discretization rule was used
for testing data. We used 10-fold cross-validation [12] as evaluation criterion.
Table 4. Classification rate results for different methods. Class EM BIHPt+i is dis-
cretized in 3 nominal categories (as described at Table 1). CR(t + i) is the classification
rate for EM BIHPt+i with different leading times i. Accuracy of 1 standard-deviation
and smoothing parameter =1000.
40
Table 5. Classification rate results for different methods. Class EM BIHPt+i is dis-
cretized in 5 nominal categories (as described at Table 2). CR(t + i) is the classification
rate for EM BIHPt+i with different leading times i. Accuracy of 1 standard-deviation
and smoothing parameter =1000.
5 Conclusion
We have presented preliminary results on the use of Bayesian network classifiers for
rating country risk (measured by EMBI+). We have compared different methods (Ta-
bles 4 and 5), and examined the effect of various smoothing parameters (Table 6). We
get rather good forecasts on the EMBI+ trend (for Brazil) up to 95% classification
rate using TAN and C4.5 classifiers. While C4.5 produces the best classification rates,
it has a somewhat less explanatory capacity. Probabilistic models such as TAN and NB
instead produce a probability distribution. TAN classifier outperforms NB, probably
because it relaxes independence assumptions. Using the Bayesian network provided by
TAN classifier, one can also perform a sensitivity analysis so as to investigate causal
relationships between macroeconomics variables.
41
Acknowledgements
The first author was supported by CNPq. The second author was supported by FAPESP
(through grant 05/57451-8). This work was (partially) developed in collaboration with
CEMAP/EESP.
References
1. J. Andrade and V.K. Teles, An empirical model of the brazilian country risk -
an extension of the beta country risk model, Applied Economics, vol. 38, 2006,
pp. 12711278.
2. R. Bouckaert, Bayesian Network Classifiers in WEKA, Tech. Report 14/2004,
Computer Science Department, University of Waikato, New Zeland, 2004.
3. X. Chen, J.S. Racine, and N.R. Swanson, Semiparametric ARX Neural Network
Models with an Application to Forecasting Inflation, IEEE Transactions on Neural
Networks: Special Issue on Neural Networks in Financial Engineering 12 (2001),
no. 4, 674684.
4. C. Chow and C. Liu, Approximating Discrete Probability Distributions with Depen-
dence Trees, IEEE Transactions on Information Theory 14 (1968), no. 3, 462467.
5. E.C. Colla and J.S. Ide, Modelo empirico probabilistico de reconhecimento de
padroes para risco pais, Proceedings of XXXIV Encontro Nacional de Economia -
ANPEC, 2006 (submited).
6. J.C. Cosset and J. Roy, The Determinant of Country Risk Ratings, Journal of
International Business Studies 22 (1991), no. 1, 135142.
7. R. Duda and P. Hart, Pattern Classification and Scene Analysis, John Willey and
Sons, New York, 1973.
8. C. Erb, C. Harvey, and T. Viskanta, Political Risk, Economic Risk and Financial
Risk, Finance Analysts Journal - FAJ 52 (1996), no. 6, 2946.
9. U.M. Fayyad and K.B. Irani, Multi-interval Discretization of Continuous-valued
Attributes for Classification Learning., IJCAI, 1993, pp. 10221029.
10. D. Heckerman, A Tutorial on Learning with Bayesian Networks, Tech. Report
MSR-TR-95-06, Microsoft Research, Redmond, Washington, 1995.
11. R. Hodrick and E.C. Prescott, Postwar U.S. Business Cycles: An Empirical Inves-
tigation, Journal of Money, Credit and Banking 29 (1997), no. 1, 116.
12. R. Kohavi, A Study of Cross-validation and Bootstrap for Accuracy Estimation and
Model Selection, IJCAI, 1995, pp. 11371145.
13. J.E. Moody, Economic Forecasting: Challenges and Neural Network Solutions, Pro-
ceedings of the International Symposium on Artificial Neural Networks, Hsinchu,
Taiwan, 1995.
14. D. Geiger N. Friedman and M. Goldszmidt Bayesian Network Classifiers, Machine
Learning 29 (1997), 131163.
15. E. Nakamura, Inflation Forecasting using a Neural Network, Economic Letters 86
(2005), no. 3, 373378.
16. A.C. Shapiro, Currency Risk and Country Risk in International Banking, Journal
of Finance 40 (1985), no. 3, 881891.
17. E. Turban, Neural Networks in Finance and Investment: Using artificial intelli-
gence to improve real-world performance, McGraw-Hill, 1995.
18. I. Witten and E. Frank, Data Mining: Practical machine learning tools and tech-
niques, 2 ed., Morgan Kaufmann, 2005.
42
Experiences from Automatic Summarization of
IMF Staff Reports
Shuhua Liu1 and Johnny Lindroos2
1
Academy of Finland and IAMSR, Abo Akademi University
2
IAMSR, Abo Akademi University
Lemminkaisenkatu 14 B, 20520 Turku, Finland
[email protected], [email protected]
Abstract: So far, the majority of the studies in text summarization have focused on develop-
ing generic methods and techniques for summarization that are often designed and heavily
evaluated on news texts. However, little has been reported on how these generic systems
will perform on text genres that will be rather different from news. In this study, we report
our experience with applying the MEAD system, a text summarization toolkit developed at
the University of Michigan, to the summarization of IMF staff reports, which represent an
inherently very different type of documents comparing to new texts.
1 Introduction
Automated text summarization tools and systems are highly anticipated information process-
ing instruments in the business world, in government agencies or in the everyday life of the
general public. While human beings have proven to be extremely capable summarizers,
computer based automated abstracting and summarizing has proven to be extremely chal-
lenging tasks.
Since 1990s there have been very active research efforts on exploring a variety of text
summarization methods and techniques such as statistical sentence scoring methods, dis-
course analysis methods based on rhetorical structure theory, and the use of lexical and on-
tology resources such as WordNet as an aid for improvements of other methods (Mani and
Maybury, 1999). Great progress has been made (Mani and Maybury, 1999; Moens and
Szpakowicz, 2004) and a number of rather impressive text summarization systems have
appeared such as the MEAD system from University of Michigan (Radev et al, 2003; 2004),
the SUMMARIST system from University of Southern California (Hovy and Lin, 1998),
and the Newsblaster from University of Columbia, all employ a collection of summarization
techniques.
With the rapid progress in the development of automated text summarization methods
and related fields, it also opens up many new grounds for further investigation. So far, the
majority of the studies in text summarization have focused on developing generic methods
and techniques for summarization that are often designed and heavily evaluated on news
texts (DUC 2001-2005, http://duc.nist.gov/). However, little has been reported on how these
generic systems will perform on text genres that will be rather different from news. In this
study, we report our experience with applying the MEAD system, a text summarization
toolkit developed at the University of Michigan, to the summarization of IMF staff reports,
which are inherently very different from new texts in terms of the substance, length and
writing style. Four sets of summarization experiments are carried out and the system sum-
43
maries are evaluated according to their similarity to the corresponding staff-written Execu-
tive Summary included in the original reports.
The IMF Staff Reports are an important source of information concerning macroeco-
nomic development and policy issues for the member countries of the IMF. They are written
by IMF mission teams (Fund economists) as the product of their missions and are carefully
re-viewed through an elaborate process of review by relevant departments in the Fund
(Harper, 1998). Although different missions have their unique kinds of concerns, and the
corresponding staff reports will be addressing different policy issues, staff reports all present
economic and policy issues in a historical perspective from past to present and future. The
structure of staff reports also reflects a standard that includes such components: (i) General
Economic Setting (ii) Policy Discussions and (iii) Staff Appraisal. The first part often con-
tains the conclusion of the last mission and the economic developments since last mission
while also points to problems and questions that would be the focus of the current mission.
It will deliver an overall picture of the current economic situation, highlights the important
aspects and critical issues. It will also point out what are the mid-term and long-term trends,
and what are only the one-off events. The second part is a report on discussions held to-
gether with the authorities about monetary and exchange rate policy, fiscal policy and oth-
ers. The focus will be on what the authorities perceived as the major obstacles in achieving
their mid-term objectives, and elaborations on policy alternatives. The third part is perhaps
the most important of all. It presents mission teams policy recommendations for the mem-
ber country, to advice on what will be the policy strategy that will ensure the joint goals of
long-term growth and balance of payments stability (Harper, 1998).
MEAD is a public domain multi-document summarization system developed at the
CLAIR group led by Prof. Dragomir Radev at University of Michigan. MEAD offers a
number of summarization (or in fact sentence extraction) methods such as position-based,
query-based, centroid, and mostly recently the LexPageRank, plus two baselines: random
and lead-based methods. In addition to the summarization methods, the MEAD package also
includes with it tools for evaluating summaries (the MEAD Eval). MEAD Eval supports two
classes of intrinsic evaluation metrics: co-selection based metrics (precision, recall, relative
utility, kappa) and content-based metrics (cosine that uses TF*IDF, simple consine that does
not use TF*IDF, and unigram-, bigram-overlap) (Radev et al, 2003, MEAD Documentation
v3.08). The MEAD system is freely available and can be downloaded from the research
groups website (http://www.summarization.com/mead/).
The overall architecture of MEAD system consists of five types of processing functions:
Preprocessing, Features Scripts, Classifiers, Re-rankers and Evaluators (Radev et al, 2003).
Prerocessing takes as input the documents to be summarized in text or HTML format, iden-
tifies sentence boundaries and transforms them into an XML representation of the original
documents. Then, a set of features will be extracted for each sentence to support the apply-
ing of different summarization methods such as position-based, centroid based or query-
based sentence extraction methods (Radev et al, 2003). Following feature calculation, a
classifier is used to compute a composite score for each sentence. The composite score is
based on weighted combination of the sentence features in a way specified by the classifier,
which can potentially refer to any features that a sentence has. After the classifier, each sen-
tence has been assigned a significance score. A re-ranker will then modify the sentence
scores by considering possible cross-sentence dependencies, source preferences, and so on.
Finally, MEAD Eval offers the instruments for evaluating summaries in pairs in terms of
their lexical similarity.
44
2 Summarizing IMF Staff Reports Using MEAD
Our study started with summarizing five IMF staff reports: the Article IV Consultation re-
ports for China, Finland, Sweden and Norway, during the year of 2004-2005. The experi-
ments are further expanded to include 30 staff reports. All the documents are downloaded
directly from the IMF publication database accessible via the IMF website
(http://www.imf.org). In this paper we present our results from a series of further experi-
ments in order to evaluate the effects of different summarization methods and to find out a
good summarization scheme for IMF staff reports.
45
tence of the document, and the significance of each sentence is determined by calculating
how similar each sentence is to this "centroid" sentence.
MEAD also includes two baseline summarizers: random and lead-based. The random
method simply put together sentences randomly selected from a document as a summary. It
assigns a random value between 0-1 to each sentence. Lead-based method choose to select
the first sentence of each document, then the second sentence of each document until the
desired size of the summary is reached. It assigns a score of 1/n to each sentence where n is
number of sentences in the document (Radev et al, 2003).
MEAD re-ranker orders the sentences by score from highest to lowest then iteratively
decides whether to add each sentence to the summary or not. At each step, if the quota of
words or sentences has not been filled, and the sentence is not too similar to any higher scor-
ing sentence already selected for the summary, the sentence in question is added to the
summary. The remaining sentences are discarded. MEAD Cosine re-ranker simply discards
sentences above a certain similarity threshold. MMRreranker, on the other hand, adjusts
sentence scores so that similar sentences end up receiving a lower score.
MEAD default classifier weights three methods: the centroid, position and length, as
equally important features/methods for scoring sentences. In our experiments presented
here, we basically adopted two summarization strategies: one summarizes a staff report as a
whole; the other summarizes a staff report broken into multiple parts according to its natural
sections. Since the sections that appear at a later part of the report (e.g. policy discussion,
staff appraisal) are as important as (if not more than) the sections that appear earlier, the
second strategy hopefully helps to achieve a balance of content in the output summary. In all
cases the MMRreranker (with Lamda as 0.2) is used instead of MEADCosine as we found
that it has a better control over redundancy (Liu, 2006). In summarizing a report as a whole,
only one classifer the Centroid method is applied. In summarizing section by section, two
classifiers are tested: (i) Centroid, (ii) Centroid + Position, where position-based method
assigns extra weighs to sentences that are at the beginning of a document. In addition, base-
lines are created using Lead-based and Random method.
Classifier : bin/default-classifier.pl
Centroid 1.0
Reranker : bin/default-reranker.pl
MMR 0.2 enidf
Compression: 5% of sentences
Classifier : bin/default-classifier.pl
Centroid 1.0 Position 1.0
Reranker : bin/default-reranker.pl
MMR 0.2 enidf
Compression: 5% of sentences
46
and the extrinsic evaluation methods measure how well a summary help in performing a
specific task. Intrinsic evaluation, on the other hand, judges the quality of a summary by
comparing it to some model summaries (e.g. a manually-written summary).
In our study we are interested first in intrinsic evaluation of the system generated sum-
maries by comparing them to the staff-written executive summary of a report. Given the
assumption that the executive summary is an ideal summary, the most optimal re-sult of
MEAD would be a summary identical to the executive summary. However, as the MEAD
summary is an extract and the executive summary is an abstract, there will be some inevita-
ble differences between the two summaries. The differences between two texts can be meas-
ured using text similarity measurements. Two different similarity measurements are applied:
semantic simi-larity and lexical similarity. Lexical similarity measures the simi-larity of the
actual words used, and does not concern itself with the meaning of the words. On the other
hand, semantic similarity at-tempts to measure similarity in terms of meaning.
While the first and foremost goal is to produce summaries that are semantically similar
to the original document, lexical similarities may potentially provide a valuable similarity
measurement. In our case, the original report and its executive summary have been writ-ten
by the same authors, similar words and expressions tend to be used to denote the same con-
cepts, so it is possible that the lexical differences between the executive summary and the
MEAD pro-duced summary are not as outstanding as they would have been if the executive
summary and original document had been written by unrelated authors.
MEAD provides two types of evaluation tools for evaluating sum-maries. One is based
on co-selection metrics and is used to com-pare two summaries that share identical sen-
tences, i.e. extractive summaries. This kind of evaluation tools are of no interest in this ex-
periment, as they are only usable in a system to system evalua-tion environment. The second
and in this case more applicable tools are based on content-based metrics, which can be used
to measure lexical similarities of two arbitrary summaries, i.e. they are not limited to evalu-
ating summaries generated by MEAD (Radev et al. 2004b). These tools apply various word
overlap algo-rithms (Cosine, Simple cosine, as well as unigram and bi-gram co-occurrence
statistics) to measure the similarity of two texts (Radev et al. 2004b; Papineni et al. 2002):
Simple cosine: cosine similarity measurement of word overlap; calculates the cosine
similarity with a simple binary count.
Cosine: Cosine similarity measurement of weighted word overlap; weights are de-
fined by TF*IDF values of words.
Token overlap: single word overlap measurement.
Bigram overlap: bigram overlap measurement.
Normalized LCS: measurement of the longest-common subsequence.
BLEU: a measurement method based on an n-gram model.
It can be noted that most of the overlap algorithms use relatively similar approaches, and
that the results from each algorithm seem to correlate with the results from the other overlap
algorithms (Lin-droos, 2006).
The other type of similarity that will be evaluated in our experiments is semantic similarity.
One approach that can be used for such evaluation is Latent Semantic Analysis (LSA),
which is de-signed to capture similarities in terms of meaning between two texts. LSA
makes it possible to approximate human judgments of meaning similarity between words.
However, for LSA to function properly, the words in the texts being evaluated must be
repre-sented in the semantic space. If they are not represented, the result will be an inaccu-
rate estimation of the similarity of the two texts. LSA has been shown to be remarkably
47
effective given the right semantic space. LSA has been found to be more accurate than lexi-
cal methods when trained on documents related to the text being evaluated. However, lexi-
cal methods have outperformed LSA when LSA has been poorly trained, i.e. trained on un-
related or too general documents in comparison to the texts being evaluated (Landauer et al.
1998; Peck et al. 2004).
Making use of the semantic space and similarity analysis tools at the University of Colo-
rado at Boulder (Latent Semantic Analysis @ CU Boulder, http://lsa.colorado.edu/), we
carried out one-to-many, document-to-document comparisons between the different sum-
mary outputs for each staff report and the corresponding executive summary. This gave us
indications of the semantic similarity between an executive summary and each of the sys-
tem summaries for the same report.
As mentioned earlier, using the right semantic space is an impor-tant part of applying the
LSA tool. The LSA tool available at the University of Colorado at Boulder website does not
have a seman-tic space for texts related to economics, or other types of text that would be
similar to IMF staff reports. As a result, the "General Reading up to 1st year college" seman-
tic space is used in the ex-periments in this thesis. Consequently, it should be noted that
LSA may not perform as well as it might have performed provided a more specialized se-
mantic space had been available.
3 Results
In this section we present results from four sets of summarization experiments.
Classifier : bin/default-classifier.pl
Centroid 1.0
Reranker : bin/default-reranker.pl
MMR 0.2 enidf
Compression: 5% of sentences
Summaries are generated separately for each of the staff reports. However the input text
contains not only the staff report, but also text boxes and the supplements to it. The system
output thus represents a summary of multi documents. The summary outputs were examined
by compared to the content of the original reports.
The China report: At a compression rate of 5%, the system summary contains 29 full
sentences: 12 sentences are from the first section (Recent Economic Development and Out-
look); 5 sentences from the second section (Policy Discussion), with two of them from text
boxes which are there to provide more historical and background information. Only one
sentence is selected from section III, the Staff Appraisal. And, 11 sentences are picked up
from the Public Information Notice part. As such, the system summary seems to be missing
out much of the most important information of a staff report, that is, the policy discussions
as well as analysis and recommendations from the mission team. The reason may be that,
since the Section I and the Public Information Notice part both have a large portion of text
describing recent economic development and outlook, the system captures this to be the
48
Centroid of the report, with the help of the Position method, which adds more weights to
content from Section I.
Another observation is that the selected sentences tend to fall short in terms of the topic
coverage. The sentences selected from Section I and IV seems to be concentrated on only
one or two issues or aspects of the economy, often with redundancy, while neglect others
aspects.
The Finland report: The Finland report is relatively shorter and the 5% system summary
is an extract of all together 16 sentences, in which ten sentences fall into Section-I (Eco-
nomic Background), two sentences from the Public Information Notice and four sentences
from Policy Discussions, with no sentences from Policy Setting and Outlook, Staff
Appraisal and Statement by the Executive Director for Finland. Similar to with the
China report, the system summary is still unbalanced with regard to the origins of the ex-
tracted sentences. The system summary seems have captured the core subjects, i.e. the popu-
lation aging and fiscal policy. It also captures the positive flavor in the original report. The
difference from the case with China report is that, it seems to be a bit better in terms of the
coverage of different aspects of the Finnish Economy. The substance covered in the ex-
tracted parts touches upon more diverse aspects and policy issues.
The Sweden 2004 report: The system summary includes 19 sentences: eight from Eco-
nomic Background; two from Policy Setting and Short-Term Outlook; four from Policy
Discussions, four from Public Information Notice (Board Assessment), none from Staff
Appraisal, perhaps because these two sections concern similar topics, and the sentences in
Policy discussions are longer than those in the Appraisal part. There is a very long sentence
in Policy Discussion that is selected. The sentence is in fact a composition of listed items
that were not properly segmented because of none dot-ending, although the result is not bad
in terms of summary quality because it captures the list of concerned issues for policy dis-
cussion. The Sweden report summary seems to be a bit more balanced in terms of sentence
origins than the China and Finland reports. This long sentence that touches upon all the im-
portant issues may have played a role.
The Sweden 2005 report: The Sweden 2005 report is identical to the Sweden 2004 report
in terms of length, format and content structure. The system output contains 18 sentences:
five from Economic Background, five from Policy Setting and Short-Term Outlook, three
from Policy Discussions, four from Public Information Notice, and none from Staff Ap-
praisal, basically also identical to the system summary for the 2004 report in terms of the
origins of the selected sentences. However, the result also reflects major differences in their
substance, with its focus being more on fiscal issues, overall economic development, etc.
The Norway report: The system summary includes all together 16 sentences. Among
these sentences, ten are from the first section Economic Back-ground, two from Policy Dis-
cussions, one from Staff Appraisal, and three from the Public Information Notice, all cen-
tered around the fiscal policy related with non-oil deficit, oil and gas revenues and GPF
(Norway Government Petroleum Fund), strong economic performance. The selected sen-
tences are unbalanced in an unfavorable way due to the effect of the use of the Position fea-
ture, i.e. sentences appearing at the beginning and early part of the report get significantly
more weight than those in the Policy Discussion and Staff Appraisal parts. The human writ-
ten executive summary has a much wider coverage.
Overall, the system has performed consistently with reports for different countries. The
results reflect the higher weight of the parts Economic Background and Public Informa-
tion Notice, but tend to overlook the parts Policy Discussion and Staff Appraisal. The
49
default classifier weights the centroid, position and length as equally important. The result
of such a summarizer on the staff reports tend to pick up sentences that appear at the very
beginning of the report as well as the earlier parts of the report. The default re-ranker favors
redundancy over topic cover-age, which is why the results from this set of experiment see
much redundancy in the extracted sentences.
3.2 Experiment 1
Learning from the pre-experiment, in this and the following experiments we devoted more
pre-processing work on the input texts before summarizing them. All additional items in the
staff reports are removed before input to the summarization system. In addition, customized
summarizers are applied. Specifically, in this set of experiment, each of the five staff reports
is summarized as a whole applying Centroid method and MMRreranker, with two compres-
sion rates, 5% and 10% of sentences. In addition, two baselines are generated for each re-
port, one is lead-based, another Random based, all at a compression rate of 5% sentences. In
total 18 summary outputs are produced.
Judging by the content in comparing to the original report, the result is significantly
different from the pre-experiments. The 5% summary contains significantly more sentences
from the policy discussion and staff appraisal than in the last experiment, while also has a
much wider topic coverage. The MMR re-ranker seems working very well. There are much
less redundancy than in the last experiment. However, the complete removing of Position
feature results in the missing-out of some important sentences that often appear at the begin-
ning of different sections.
3.3 Experiment 2
Next, the China report is broken into three parts by its natural sections and then summarized
as one cluster. Two summarizers described earlier (Centroid vs Centroid+Position) are ap-
plied. Again the experiments are repeated with two compression rates, 5% and 10% of sen-
tences. Four summaries are created.
At 5% compression rate, the Centroid method extracted 6 sen-tences from Part I (Eco-
nomic Background), 11 sentences from Policy Discussions and 5 sentences form Staff Ap-
praisal. The spe-cific summarization process guarantees a good balance of the ori-gin of the
selected sentences. With a compression rate of 10%, although the length of the summary
doubles (while always included all the sentences from the 5% compression rate experiment),
there is not much redundancy in the result due to the effect of the MMR re-ranker.
All the system generated summaries from Experiment 1 and 2 are evaluated by examin-
ing their semantic similarity to the model summary the staff written executive summary,
using LSA analysis. The similarity measure can be e.g. cosine similarity of document vec-
tors in the semantic space. The best available semantic space seems to be the Gen-
eral_Reading_up_to_1st_year_college. The results are shown in Table 1 below. Although
this does not give us the precise similarity values, they give us an indication of the relative
performance of different summarization scheme. (Note: The baselines for the China report
were only created in the experiments of 10% sentence compression rate, so they are not
shown here. Nonetheless, judging from the 10% result, there is not much surprise as com-
pared with other reports).
50
Table 1 Comparing System Summaries with Executive Summaries
using Latent Semantic Analysis
(all with MMR re-ranker)
China04 Fin04 Swed.04 Swed.05 Nor.04
5%C 0.79 0.85 0.91 0.84 0.83
10%C 0.80 0.86 0.91 0.88 0.86
5%Lead 0.84 0.79 0.75 0.75
5%Rand. 0.79 0.89 0.77 0.78
0.78 5%Multipart Centroid
0.79 10%Multi-part Centroid
0.83 5%Multi-part Centroid+Position
0.82 10%Multi-part Centroid+Position
The results seem to indicate that, the performances of the different summarization ap-
proaches for the China report do not differ very much among the Centroid or Multi-parts
Centroid methods. But the Multi-parts Centroid + Position approach looks slightly better.
Overall, the compression rate of 5% and 10% does not make much difference to the evalua-
tion result, although judging from the content of the summaries, the latter deliver much more
rich information than the former.
The same Centroid method performs rather consistently on the Finland report, Sweden
2005 report and Norway report, but differs a bit with the China report and Sweden 2004
report, may be due to the slightly different ways the substance in the executive summary
was drawn up. The Centroid method is shown performing better than Lead-based and Ran-
dom methods in all cases, but only slightly. What is surprising is that the Random method
seems to over-perform Lead-based method, except in the case of the Finland report.
The selection of the semantic space turns out to be not the best choice, as the results are
accompanied by remarks of terms in the reports that do not exist in the corpus selected.
However, for the moment this seems to be the closest proximity to a semantic space in the
Economy domain that is available on the site.
3.4 Experiment 3
In addition to the above evaluation, extensive summarization experiments are carried out on
30 IMF staff reports from the year 2003-2005, applying six different summarization con-
figurations to each report: (1) Centroid method with MEAD default reranker (C-D), (2) Cen-
troid+Position method with MEAD default reranker (CP-D), (3) Random method (Rand.);
(4) Lead method (Lead); (5) Centroid method with MMR reranker (C-MMR); and (6) Cen-
troid+Position method with MMR reranker (CP-MMR); all at a compression rate of 5%
sentences. In total, 180 summaries were produced, 6 summaries for each report. All the sys-
tem outputs are evaluated against the corresponding staff written Executive Summaries us-
ing five of the content based evaluation metrics: Simple cosine, Cosine (tf*idf), Token
Overlap, Bigram Overlap and BLEU.
The performance of the six different summarization configurations along the five differ-
ent lexical similarity measurements is given in the tables below (paired t-test: probability
associated with a Student's paired t-test, with two tailed distribution).
The results again confirm the correlation among the different overlapping measure-
ments. If a summarization scheme performs well at one similarity metric, it generally also
performs well in terms of the other four similarity metrics. Overall, it can be noted that,
among the four summarization schemes C-D, CP-D, C-MMR and CP-MMR, CP-D and CP-
51
MMR show somewhat better performance than the C-D and C-MMR respectively. When
the summarization method is held unchanged (C or CP), the summarization schemes adopt-
ing the MEAD default re-ranker usually exhibit inferior performance comparing to the
summarization schemes using MMRreranker. The two baselines are sometimes the top per-
former and sometimes the worst performer. Their performance is very unstable, which is
also indicated by the highest standard deviations they usually have.
Cosine Similarity
Token Overlap
Bi-gram Overlap
BLEU
The paired t-tests (null hypothesis) show much varied results with different similarity meas-
urements. Judging by simple cosine similarity and token overlap, only the performance dif-
ference between the Random method and CP-D is statistically significant (p<0.05). All the
other differences are statistically insignificant. In terms of cosine similarity however, the
52
difference between Random method and CP-D as well as between Lead method and Ran-
dom method are statistically significant. Judging by bi-gram overlap, the performance dif-
ferences between the paired methods are all in-significant. Finally, using the BLEU metric,
the paired t-test indicates significant performance difference between C-D and CP-D, be-
tween Lead and Random method, between CP-MMR and C-MMR. On the contrary, the
performance difference between method Random and CP-D, between Lead and Random
method, and between C-MMR and Lead method are not statistically significant, which is
counter-intuitive. This seems to suggest that BLEU as a document similarity measurement is
more different from the other four measurements. Human judgment needs to be incorporated
to find out whether this may suggest that BLEU is a less appropriate evaluation metric than
others, and which ones are the more appropriate summary evaluation metrics.
4 Conclusions
In this paper we report our experience with applying the MEAD system to summarize the
IMF staff reports. Comparing to news articles that are usually simpler in format and much
shorter in length, in summarizing the IMF staff reports, it is necessary to incorporate suitable
preprocessing, and take advantage of the flexibility in selecting and combining multiple
methods. Comparing with the rich variety in writing style for news text, the IMF Staff re-
ports are much more predictable due to their very standardized format for structuring and
presenting the content. This makes it easier to see the effects of different summarization
schemes.
In order to achieve good summary results, it would require a good understanding of the
summarization methods in the system. However, to find out the best combinations of meth-
ods requires non-trivial effort. Also, one insight gained from the experiments is that, the
IMF staff reports are documents that contain what may be called multi-centroid or multi-
subject that are equally important, while the MEAD Centroid based method is supposed to
capture only one centroid.
Compression rate naturally should have a big impact on the output summaries. The LSA
based similarity analysis of system summary and the corresponding executive summary,
however, show basically little difference between 5% and 10% outputs. This may indicate
that the 5% compression rate is a rather good choice for summarizing staff reports if outputs
length is a critical measurement. However, what should also be noted is that, at 10% com-
pression rate the system outputs deliver much more complete content than at a 5% compres-
sion rate. It is very hard for machine-generated summaries to be wide in coverage and at the
same time short in length. The system generated summary will usually be considerably
lengthy if it is expected to be as informative as the Executive Summary. The results also
revealed that depending solely on current LSA tool to evaluate the summarization results
based on similarity analysis is not enough.
Further evaluations have been carried out using a number of content-based evaluation
metrics. The results confirm the correlation among the different overlapping measurements.
In addition, it shows that the CP-D and CP-MMR schemes show somewhat better perform-
ance than C-D and C-MMR respectively. When the summarization method is held un-
changed (C or CP), the summarization schemes adopting MMRreranker exhibits better per-
formance comparing to the summarization schemes using the MEAD default re-ranker. The
two baselines are sometimes the top performer and sometimes the worst performer. Their
performances tend to be unstable. To find out a best summarization scheme and best evalua-
53
tion metrics for summarizing IMF staff reports, expert evaluation of summary quality will
be incorporated in our future research.
Finally, the IMF staff reports have evident content structuring characteristics that could
be utilized to benefit the summarization output. Each section of a staff report is divided into
a set of numbered paragraphs. Each paragraph appears to start with a sentence describing the
most essential information that is discussed or can be derived from that paragraph, with the
rest of the paragraph aimed towards supporting the information presented in the initial sen-
tence. This would not only indicate that the original document contains sentences suitable to
be extracted for a summary, but also make it a rational approach to simply extract all such
introductory sentences to produce a summary of reasonable quality. This approach is further
explored in other experiments.
Acknowledgments Financial support from Academy of Finland is gratefully acknowledged.
References
[1] Mani I. and Maybury M. T. (Eds.) (1999). Advances in Automatic Text Summarization, MIT Press.
Cambridge, MA.
[2] Moens M. and Szpakowicz S. (Eds.) (2004). Text Summarization Braches Out, Proceedings of the
ACL-05 Workshop, July 25-26 2004, Barcelona, Spain
[3] Radev D., Allison T., Blair-Goldensohn S., Blitzer J., Celebi A., Drabek E., Lam W., Liu D., Qi H.,
Saggion H., Teufel S., Topper M. and Winkel A. (2003). The MEAD Multidocument Summarizer.
MEAD Documentation v3.08. available at http://www.summarization.com/mead/.
[4] Radev D., Jing H., Stys M. and Tam D. (2004). Centroid-based summarization of multiple docu-
ments. Information Processing and Management, 40, 919-938.
[5] Hovy E. and Lin C. (1999). Automated Text Summarization in SUMMARIST. Mani and Maybury
(eds), Advances in Automatic Text Summarization, MIT Press, 1999.
[6] Harper, R. (1998). Inside the IMF, Academic Press. 1st Edition. Academic Press.
[7] Carbonell J. G. and J. Goldstein. The Use of MMR, Diversity-Based Reranking for Reordering
Documents and Producing Summaries. In Alistair Moffat and Justin Zobel, editors, Proceedings of the
21st Annual International ACM SIGIR Conference on Research and Development in Information Re-
trieval, pages 335-336, Melbourne, Australia, 1998.
[8] Liu S. and J. Lindroos, "Automated Text Summarization using MEAD: Experience with the IMF
Staff Reports", to be presented at IIIA2006 -International Workshop on Intelligent Information Access,
July 6-8, 2006, Helsinki.
[9] Lindroos J. Automated Text Summarization using MEAD: Experience with the IMF Staff Reports,
Master Thesis, bo Akademi University, 2006
54
Taping into the European Energy Exchange
(www.EEX.de) to feed a Stochastic Portfolio
Optimization Tool SpOt for Electric Utilities
1 Introduction
The wealth of consumers around the globe depends on affordable electric energy.
While in general energy prices have been on a roller coaster ride within the last 4
years, electric utilities need to find new ways to provide energy at low prices and low
risk.
What is the best mix of energy assets to hold for the risk we are willing to absorb
given our customers demand profiles? What are the key asset decisions we should
make and when should we make them? What options should we exercise? What long
term contracts should we negotiate? All these questions relate to changes in an
electric utilitys asset portfolio. Within the electric utility industry, portfolio
optimization is the process of implementing strategy and information technology
solutions to maximize the value and manage the risk of integrated energy portfolios
over the near-, medium-, and long-term. While conventional portfolio management
solutions rely on or heuristics and/or cumbersome trial and error iterations only a
55
stochastic portfolio optimization model can provide full transparency and find
automatically the answers to the questions listed above.
For this purpose the Fraunhofer Institut fr Umwelt-, Sicherheits- und
Energietechnik UMSICHT developed in cooperation with the companies Capgemini
Deutschland GmbH and SAS Institute GmbH the Stochastic Portfolio Optimization
Tool SpOt. Robust mathematical approaches including jump diffusion and GARCH
models as well as stochastic portfolio optimization techniques are applied to develop
a dynamic portfolio management solution. Our approaches focus on delivering
flexible portfolio management strategies that adapt to changing markets and
regulatory parameters (for example carbon dioxide certificates). The solution SpOt
can be applied in the following four areas: 1) Asset optimization providing support for
a wide set of problems, ranging from long term investment decisions (e.g., the
upgrade of a power plant or decision to install pollution controls) to medium-term
strategy support (e.g., implementing the right mix of contracts and options). 2) Asset
deployment characterizing the short-term (e.g., daily) operational decisions on assets.
Examples include traders daily nomination decisions on purchases to meet
generation demand. 3) Asset valuation determines the value of a tangible or intangible
asset (e.g., a tangible power plant or the intangible emissions allowances). 4)
Company wide risk controlling and overall portfolio management
Since history never repeats itself, reliance on constant historical price volatility
estimates and historical price curves will lead to sub optimal performance. Our
approach estimates therefore a GARCH model to determine the underlying
characteristic parameters of an endogenous price volatility process. User specified
price forward curves that reflect fundamental expectations of future price levels are
used as a basis to simulate thousands of price trajectories around the expected future
price by means of the mean-reverting GARCH model. Mean reversion is a tendency
for a stochastic process to remain near, or tend to return over time to a long-run
average level. In this way we are able to generate a large number of forward-looking
price trajectories to incorporate many alternate views on future price levels and price
volatilities into an integrated valuation and decision-support framework for a
subsequent holistic stochastic optimization.
3 Stochastic Optimization
56
unknown parameters. Stochastic optimization models take advantage of the fact that
probability distributions governing the data are known or can be estimated or
simulated. The goal here is to find some policy that is feasible for all simulated data
instances and maximizes the expectation of some utility function of the decisions and
the random variables. More generally, such models are formulated, solved, and
analyzed in order to provide useful information to a decision-maker in the real world
with uncertain information.
The key modelling feature in our objective function is a convex combination of
Cvar (risk) and average procurement cost over all price trajectories:
where Cvar is the conditional Value at risk, a is the users risk preference. For a=0 the
user wants to reduce risk at any costs, and for a=1 the user prefers to reduce cost no
matter what the risk is.
SpOt solves problem (1) for different values of a in [0,1] and prompts the user with
a portfolio efficiency frontier. For any energy portfolio under consideration the
corresponding point following along an imaginary horizontal line until the efficiency
frontier will yield a better hedge with lower lover average cost at the same risk.
Following instead an imaginary vertical line until we hit the efficiency frontier we can
determine a better hedge with lower lover risk at same cost. Thus the resulting
portfolio adjustment transactions to move to the efficiency frontier are easily
determined.
Users of SpOt get full transparency over their portfolio restructuring process and
are able to determine optimal portfolios that will better satisfy their risk preference. In
this way SpOt helps to obtain lower operating cost and risk. These savings provide
utilities with stronger competitiveness and will ultimately contribute to dampen
electric energy prices for consumers.
References
1. Spangardt, G.; Lucht, M.; Handschin, E.: Applications for Stochastic Optimization in the
Power Industry. Electrical Engineering (Archiv fr Elektrotechnik), 2004, in Druck, online
DOI 10.1007/s00202-004-0273-z
57
2. Spangardt, G.; Lucht, M.; Althaus, W.: Optimisation of Physical and Financial Power
Purchase Portfolios. Central European Journal of Operations Research 11/2003, S. 335-350.
3. Spangardt, G; Lucht, M.; Althaus, W.: Automatisierte Portfolio-Optimierung fr Stadtwerke
und Industrie- kunden. VDI-Berichte 1792: Optimierung in der Energieversorgung, VDI
Verlag, Dsseldorf, 2003, S. 13-22.
4. Spangardt, G; Lucht, M.; Althaus, W.; Handschin, E.: Vom Portfoliomanagement zur
Portfolio-Optimierung. Marktplatz Energie, April 2003, S. 6-7.
5. Spangardt, G.: Mittelfristige risikoorientierte Optimierung von Strombezugsportfolios.
Dissertation, Fakultt fr Elektrotechnik und Informationstechnik, Universitt Dortmund,
UMSICHT-Schriftenreihe, Band 43, Fraunhofer IRB- Verlag, Stuttgart.
58
Mining Medical Administrative Data -
The PKB System
1 Introduction
Hospitals are driven by the twin constraints of maximising patient care while
minimising the costs of doing so. For public hospitals in particular, the overall
budget is generally fixed and thus the quantity (and quality) of the health care
provided is dependent on the patient mix and the costs of provision.
Some of the issues that hospitals have to handle are frequently related to
resource allocation. This requires decisions about how best to allocate those re-
sources, and an understanding of the impacts of those decisions. Often the impact
can be seen clearly (for example, increasing elective surgery puts more surgical
patients in beds, as a result constraining admissions from the emergency de-
partment leading to an increase in waiting times in the emergency department)
but the direct cause may not be apparent (is the cause simply more elective
patients? is it discharge practices? has the average length of stay changed? is
there a change in the casemix in emergency or elective patients? and so on).
Analysing the data with all these potential variables is difficult and time con-
suming. Focused analysis may come up with a result that explains the change
but an unfocussed analysis can be a fruitless and frustrating exercise.
As a part this resource pressure, hospitals are often unable to have teams of
analysts looking across all their data, searching for useful information such as
59
trends and anomalies. For example, typically the team charged with managing
the patient costing system, which incorporates a large data repository, is small.
These staff may not have strong statistical backgrounds or the time or tools
to undertake complex multi-dimensional analysis or data mining. Much of their
work is in presenting and analysing a set of standard reports, often related to
the financial signals that the hospital responds to (such as cost, revenue, length
of stay or casemix). Even with OLAP tools and report suites it is difficult for
the users to look at more than a small percentage of the available dimensions
(usually related to the known areas of interest) and to undertake some ad hoc
analysis in specific areas, often as a result of a targeted request, e.g. what are
the cost drivers for liver transplants?
Even disregarding the trauma of an adverse patient outcome, adverse events
can be expensive in that they increase the clinical intervention required, result-
ing in higher-than-average treatment costs and length-of-stay, and can also result
in expensive litigation. Unfortunately, adverse outcomes are not rare. A study
by Wolff et. al. [1] focusing on rural hospitals estimated that .77% of patients
experienced an adverse event while another by Ehsani et.al., which included
metropolitan hospitals, estimated a figure of 6.88% [2]. The latter study states
that the total cost of adverse events ... [represented] 15.7% of the total expendi-
ture on direct hospital costs, or an additional 18.6% of the total inpatient hospital
budget. Given these indicators, it is important that the usefulness of data mining
techniques in reducing artefacts such as adverse effects is explored.
A seminal example of data mining use within the hospital domain occurred
during the Bristol Royal Infirmary inquiry of 2001 [3] in which data mining algo-
rithms were used to create hypotheses regarding the excessive number of infant
deaths at the Bristol Royal Infirmary that underwent open-heart surgery. In a
recent speech, Sir Ian Kennedy (who lead the original inquiry) said, with respect
to improving patient safety, that The [current] picture is one of pockets of activ-
ity but poor overall coordination and limited analysis and dissemination of any
lessons. Every month that goes by in which bad, unsafe practice is not identified
and rooted out and good practice shared, is a month in which more patients die
or are harmed unnecessarily. The roll of data mining within hospital analysis
is important given the complexity and scale of the analysis to be undertaken.
Data mining can provide solutions that can facilitate the benchmarking of pa-
tient safety provision, which will help eliminate variations in clinical practice,
thus improving patient safety.
The Power Knowledge Builder (PKB) project provides a suite of data min-
ing capabilities, tailored to this domain. The system aims to alert management
to events or items of interest in a timely manner either through automated ex-
ception reporting, or through explicit exploratory analysis. The initial suite of
algorithms (trend analysis, resource analysis, outlier detection and clustering)
were selected as forming the core set of tools that could be used to perform
data mining in a way that would be usable to educated users, but without the
requirement for sophisticated statistical knowledge.
60
To our knowledge, PKBs goal is unique it is industry specific and does
not require specialised data mining skills, but aims to leverage the data and
skills that hospitals already have in place. There are other current data mining
solutions, but they are typically part of a more generic reporting solutions (i.e.
Business Objects, Cognos) or sub-sets of data management suites such as SAS or
SQL server. These tools are frequently powerful and flexible, but are not targeted
to an industry, and to use them effectively requires a greater understanding
of statistics and data mining methods than our target market generally has
available. This paper introduces the PKB suite and its components in Section 2.
Section 3 discusses some of the important lessons learnt, while Section 4 presents
the current state of the project and the way forward.
The PKB suite is a core set of data mining tools that have been adapted to
the patient costing domain. The initial algorithm set (anomaly detection, trend
analysis and resource analysis), was derived through discussion with practition-
ers, focusing upon potential application and functional variation. Subsequently,
clustering and characterisation algorithms were appended to enhance usefulness.
Each algorithmic component has an interface wrapper, which is subsequently
incorporated within the PKB prototype. The interface wrapper provides effec-
tive textual and graphical elements, with respect to pre-processing, analysis and
presentation stages, that simplifies both the use of PKB components and the
interpretation of their results. This is important as the intended users are hos-
pital administrators, not data mining practitioners and hence the tools must be
usable by educated users, without requiring sophisticated statistical knowledge.
Outlier (or anomaly) detection is a mature field of research with its origins
in statistics [4]. Current techniques typically incorporate an explicit distance
metric, which determines the degree to which an object is classified as an outlier.
A more contemporary approach incorporates an implied distance metric, which
alleviates the need for the pairwise comparison of objects [5, 6] by using domain
space quantisation to enable distance comparisons to be made at a higher level of
abstraction and, as a result, obviates the need to recall raw data for comparison.
The PKB outlier detection algorithm LION contributes to the state of the
art in outlier detection, through novel quantisation and object allocation, that
enables the discovery of outliers in large disk resident datasets in two sequential
scans. Furthermore LION addresses the need realised during this project of the
algorithm to discover not only outliers but also outlier clusters. By clustering
similar (close) outliers and presenting cluster characteristics it becomes easier
for users to understand the common traits of similar outliers, assisting the iden-
tification of outlier causality. An outlier analysis instance is presented in Figure
1, showing the interactive scatterplot matrix and cluster summarisation tables.
61
Fig. 1. Outlier Analysis with Characterisation Tables and Selection
2.3 Characterisation
Characterisation (also a secondary component) was initially developed as a sub-
sidiary for outlier and clustering analysis in order to present descriptive sum-
maries of the clusters to the users. However it is also present as an independent
tool within the suite. The characterisation algorithm provides this descriptive
62
cluster summary by finding the sets of commonly co-occurring attribute values
within the set of cluster objects. To achieve this, a partial inferencing engine,
similar to those used in association mining [7] is used. The engine uses the extent
of an attribute values (elements) occurrence within the dataset to determine its
significance and subsequently its usefulness for summarisation purposes. Once
the valid elements have been identified, the algorithm deepens finding progres-
sively larger, frequently co-occurring elements sets from within the dataset.
Given the target of presenting summarised information about a cluster, the
valid elements are those that occur often within the source dataset. While this
works well for non-ordinal data, ordinal data requires partitioning into ranges,
allowing the significant mass to be achieved. This is accomplished by progres-
sively reducing the number of partitions, until at least one achieves a significant
volume. Given the range 1 to 100, and initial set of 26 partitions are formed, if
no partition is valid, each pair of partitions are merged, by removing the lowest
significant bit, (25 partitions). This process continues until a significant mass is
reached. This functionality is illustrated in Figure 1 through the presentation of
a summarisation table with ordinal-ranges.
63
Fig. 2. Resource Analysis
scaling and linear trend removal. The component then provides a rich higher
level functional set incorporating, temporal variation, the identification of offset
and subset trends, and the identification of dissimilar as well as similar trends,
as illustrated in Figure 3 . The component provides a comprehensive interface
including ordered graphical pair-wise representations of results for comparison
purposes. For example, the evidence of an offset trend between two wards with
respect to admissions can indicate a causality that requires further investigation.
3 Lessons Learned
The PKB project began as a general investigation into the application of data
mining techniques to patient costing software, between Flinders University and
PowerHealth Solutions, providing both an academic and industry perspective.
Now, 18 months on from its inception, many lessons have been learnt that will
hopefully aid both parties in future interaction with each other and with other
partners. From an academic viewpoint, issues relating to the establishment of
beta test sites and the bullet proofing of code is unusual. While from industry
the meandering nature of research and the potential for somewhat tangential
results can be frustrating. Overall, three main lessons have been learnt.
Solution looking for a problem? It is clear that understanding data and de-
riving usable information and insights from it is a problem in hospitals, but
how best to use the research and tools is not always clear. In particular, the
initial project specification was unclear as to how this would be achieved.
64
Fig. 3. Trend Analysis Presentation: parameter setting
65
4 Current State and Further Work
The second version of the PKB suite is now at beta-test stage, with validation
and further functional refinement required from industry partners. The suite
currently consists of a set of fast algorithms with relevant interfaces that do
not require special knowledge to use. Of importance in this stage is feedback
regarding the collection and pre-processing stages of analysis and how the suite
can be further refined to facilitate practitioners in undertaking this.
The economic benefits of the suite are yet to be quantified. Expected areas of
benefit are in the domain of quality of care and resource management. Focusing
upon critical indicators, such as death rates and morbidity codes, in combination
with multiple other dimensions (e.g. location, carer, casemix and demographic
dimensions) has the potential to identify unrealised quality issues.
Three immediate areas of further work are evident: the inclusion of extrane-
ous repositories, knowledge base construction and textual data mining. The in-
corporation of extraneous repositories such as meteorological and socio-economic
within some analysis routines can provide useful information regarding causality.
While the incorporation of an evolving knowledge base will facilitate analysis by
either eliminating known information from result sets or the flagging of critical
artefacts. As most hospital data is not structured, but contained in notes, de-
scriptions and narrative the mining of textual information will also be valuable.
References
1. Wolff, A.M., Bourke, J., Campbell, I.A., Leembruggen, D.W.: Detecting and re-
ducing hospital adverse events: outcomes of the Wimmera clinical risk management
program. Medical Journal of Australia 174 (2001) 621625
2. Ehsani, J.P., Jackson, T., Duckett, S.J.: The incidence and cost of adverse events
in Victorian hospitals 2003-04. Medical Journal of Australia 184 (2006) 551555
3. Kennedy, I.: Learning from Bristol: The report of the public inquiry into childrens
heart surgery at the Bristol Royal Infirmary 1984-1995. Final report, COI Commu-
nications (2001)
4. Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches.
Signal Processing 83 (2003) 24812497
5. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large
datasets. In Gupta, A., Shmueli, O., Widom, J., eds.: 24th International Confer-
ence on Very Large Data Bases, VLDB98, New York, NY, USA, Morgan Kaufmann
(1998) 392403
6. Papadimitriou, S., Kitagawa, H., Gibbons, P., Faloutsos, C.: Loci: Fast outlier
detection using the local correlation integral. In Dayal, U., Ramamritham, K.,
Vijayaraman, T., eds.: 19th International Conference on Data Engineering (ICDE),
Bangalore, India (2003) 315326
7. Ceglar, A., Roddick, J.F.: Association mining. ACM Computing Surveys 38 (2006)
8. Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer Verlag,
New York (1987)
9. Keogh, E.K., Chakrabarti, K., Mehorta, S., Pazzani, M.: Locally adaptive dimen-
sionality reduction for indexing large time series databases. In: ACM SIGMOD In-
ternational Conference on Management of Data, Santa Barbara, CA, ACM (2001)
151162
66
Combining text mining strategies to classify requests
about involuntary childlessness to an internet medical
expert forum
1 Introduction
Both healthy and sick people increasingly use electronic media to get medical
information and advice. Internet users actively exchange information with others
about subjects of interest or send requests to web-based expert forums (ask-the-
doctor services) [1-3]. An automatic classification of these requests could be helpful
for several reasons: (1) a request can be forwarded to the respective expert, (2) an
automatic answer can be prepared by localising a new request in a cluster of similar
67
requests which have already been answered, and (3) changes in information needs of
the population at risk can be detected in due time.
Text mining has been successfully applied, for example, in ascertaining and
classifying consumer complaints or to handle changes of address in emails sent to
companies. The classification of requests to medical expert forums is more difficult
because these requests can be very long and unstructured by mixing, for example,
personal experiences with laboratory data. Moreover, requests can be classified (1)
according to their subject matter or (2) with regard to the senders expectation. While
the first aspect is of high importance for the medical experts to understand the
contents of requests, the latter is of interest for public health experts to analyse
information needs within the population. Therefore, the pressure to extract as much
useful information as possible from health requests and to classify them appropriately
is very strong.
To make full use of text mining in the case of complex data, different strategies
and a combination of those strategies may refine automatic classification and
clustering [4]. The aim of this paper is to present a highly flexible, author-controlled
method for an automatic classification of requests to a medical expert forum and to
assess its performance quality. This method was applied to a sample of requests
collected form the section Wish for a Child on the German website www.rund-ums-
baby.de.
2 Methods
68
To assess the most appropriate model for a classification, we used the following
selection methods: (1) Akaike Information Criterion, (2) Schwarz Bayesian Criterion,
(3) cross validation misclassification of the training data (leave one out), (4) cross
validation error of the training data (leave one out) and (5) variable significance based
on an individually adjusted variable significance level for the number of positive
cases.
By means of logical combinations of input variables with the different selection
criteria 1,761 models were trained. The statistical quality of the subject classification
was determined, among others, by recall and precision (Table).
Selection Input
Class N Criterion* P (%) Variables** Precision Recall
Subject Matters (examples)***
Abortion 40 AVSL 40 pc 83 100
Tubal examination 19 XMISC 1 k 100 100
Hormones 36 AVSL 30 svd 67 89
Expenses 25 XERROR 1 k 100 100
Worries during pregnancy 49 XMISC 40 pc 87 100
Pregnancy symptoms 36 XERROR 40 k 90 100
Semen analysis 57 XMISC 20 k 87 87
Hormonal stimulation 40 AVSL 40 pc 77 100
Cycle 79 AIC 30 k 88 75
Expectations (complete)
General information 533 AIC 40 k 83 89
Current treatment 331 AVSL 40 k 77 84
Interpretation of results 310 AVSL 40 k 79 86
Emotional reassurance 90 SBC 1 k 68 65
Interpretation 242 XERROR 30 k 60 82
Treatment opportunities 351 SBC 30 k 77 82
* AIC = Akaike Information Criterion, SBC = Schwarz Bayesian Criterion, XMISC = cross validation
misclassification, XERROR = cross validation error, AVSL = adjusted variable significance level
** k = Cramers V-variables, pc = principal components, svd = singular value dimensions
*** Out of the 988 requests sent to the expert forum, some examples of how the requests were classified
are given. The complete list comprised 32 subject matters.
69
3 Results
A 100% precision and 100% recall could be realized in 10 out of 38 categories on the
validation sample; some examples are given in the Table. The lowest rate for
precision was 67%, for recall 75% in the subject classification. The last six categories
(Expectations), however, performed considerably below their subject matter peers.
The Cramers V-variables (k) and the principal components (pc) proved to be
superior to the 240 singular value dimensions (svd) for automatic classification. At
the workshop, we will also present how the combination of different predictive
models improved recall and precision for both types of classification problems.
Finally, we will sketch out the approach for automatic answering of health requests on
the basis of text mining.
4 Discussion
References
1. Eysenbach G., Diepgen T.L.: Patients looking for information on the Internet and seeking
teleadvice: motivation, expectations, and misconceptions as expressed in e-mails sent to
physicians. Arch. Dermatol. 135 (1999) 151-156
2. Himmel W., Meyer J., Kochen M.M., Michelmann H.W.: Information needs and visitors
experience of an Internet expert forum on infertility. J. Med. Internet Res. 7 (2005) e20
3. Umefjord, G., Hamberg, K., Malker H, Petersson, G.: The use of an Internet-based Ask the
Doctor Service involving family physicians: evaluation by a web survey. Fam. Pract. 23
(2006) 159-166
4. Jeske D., Liu R.: Mining Massive Text Data and Developing Tracking Statistics. In: Banks
D., House L., McMorris F.R., Arabie P., Gaul W. (eds.): Classification, Clustering and Data
Mining Application. Springer-Verlag, Berlin (2004) 495-510
5. Reincke, U. Profiling and classification of scientific documents with SAS Text Miner.
Available at http://km.aifb.uni-karlsruhe.de/ws/LLWA/akkd/8.pdf
70
Mining in Health Data by GUHA method
Jan Rauch
1 Introduction
GUHA is an original Czech method of exploratory data analysis developed since
1960s [3]. Its principle is to offer all interesting facts following from the given data
to the given problem. It is realized by GUHA procedures. The GUHA-procedure
is a computer program the input of which consists of the analysed data and of
several parameters defining a very large set of potentially interesting patterns.
The output is a list of all prime patterns. The pattern is prime if both it is
true in the analysed data and if it does not immediately follow from other more
simple output patterns.
The most used GUHA procedure is the procedure ASSOC [3] that mines for
association rules describing various relations of two Boolean attributes including
rules corresponding to statistical hypotheses tests. The classical association
rules with confidence and support [1] are also mined however under the name
founded implication [4]. The procedure ASSOC was several time implemented,
see e.g. [5]. Its last and contemporary most used implementation is the procedure
4ft-Miner [6] that has some new features. Implementation of procedure ASSOC
does not use the well known a-priori algorithm. It is based on representation of
analysed data by suitable strings of bits [6]. This approach proved to be very
efficient to compute various contingency tables [6, 8]. It led to implementation
of five additional GUHA procedures. All procedures are included in the system
LISp-Miner [7] (http://lispminer.vse.cz).
These GUHA procedures were many time applied to various health data. The
goal of the paper is to present experience with these applications. We use data set
STULONG1 (http://euromise.vse.cz/challenge2004/data/index.html) to
demonstrate some important features of two GUHA procedures, see next section.
?
The work described here has been supported by the grant 1M06014 of Ministry
of Education, Youth and Sports of the Czech Republic and by the grant 25/05 of
University of Economics, Prague
1
The study (STULONG) was realized at the 2nd Department of Medicine, 1st Faculty
of Medicine of Charles University and Charles University Hospital, Prague 2 (head.
Prof. M. Aschermann, MD, SDr, FESC), under the supervision of Prof. F. Boudk,
MD, ScD, with collaboration of M. Tomeckova, MD, PhD. and Ass. Prof. J. Bultas,
MD, PhD. The data were transferred to the electronic form by the EuroMISE Centre
of Charles University and Academy of Sciences (head. Prof. RNDr. J. Zv arov
a,
DrSc).
71
2 Applying 4ft-Miner and SD4ft-Miner
There are six GUHA procedures in the LISp-Miner system that mine various
patterns verified using one or two contingency tables, see http://lispminer.
vse.cz/procedures/. The contingency tables make possible to formulate pat-
terns that are good understandable for users - nonspecialists in data mining.
The bit string approach used in implementation also brings a possibility to easy
deal with (automatically) derived Boolean attributes of the form A(). Here
is a subset of the set a1 , . . . , ak of all possible values of attribute A. Boolean
attribute A() is true in a row o of analysed data matrix M if it is a where
a is the value of attribute A in the row o.
We use procedures 4ft-Miner and SD4ft-Miner and data matrix Entry of the
STULONG data set to demonstrate some of possibilities of GUHA procedures.
Data matrix Entry concerns 1417 patients that are described by 64 attributes,
see http://euromise.vse.cz/challenge2004/data/entry/.
The procedure 4ft-Miner mines for association rules where and
are Boolean attributes. The rule means that and are associated
in a way given by the symbol that is called 4ft-quantifier. The 4ft-quantifier
corresponds to a condition concerning a four-fold contingency table of and
. The Boolean attributes and are automatically derived from columns of
analyzed data matrix. The association rule is true in data matrix M if
the condition corresponding to is satisfied in contingency table of and
in M. There are 14 types of 4ft-quantifiers implemented in 4ft-Miner including
quantifiers corresponding to statistical hypotheses tests [6]. An example of the
association rule produced by 4ft-Miner is the rule
Heighth166, 175i Coffee(not) +
0.71, 51 Triglicerides( 96) .
The corresponding contingency table is at Fig. 1 (patients with missing values
are automatically omitted). It says that there are 51 patients satisfying both
M
51 122 173
144 812 956
195 934 1129
Fig. 1. Contingency table of Height h166, 175i Coffee(not) and Triglicerides( 96)
and , 122 patients satisfying and not satisfying etc. The presented rule
means that relative frequency of patients with triglycerides 96 mg% among
51
patients 166175 cm high that do not drink coffee (i.e. 173 = 0.29) is 71 per cent
greater than average relative frequency of patients with triglycerides 96 mg in
195
the whole data matrix (i.e. 1129 = 0.17). The support of the rule is 51.
The presented rule is one of results of a run of 4ft-Miner related to the
analytical question What combinations of patients characteristics lead to at least
72
20 per cent greater relative frequency of extreme values of triglycerides?. We used
parameters defining derived Boolean attributes like Heighth?, ? + 10i (i.e. sliding
window of the length 10) or Triglicerideshmin, ?i and Trigliceridesh?, maxi (i.e.
extreme values of Triglicerides). There were 281 rules found among 918 300 rules
tested in 59 seconds on PC with 1.58 GHz. For more details concerning 4ft-Miner
see e.g. [6].
The procedure SD4ft-Miner mines for SD4ft-patterns i.e. expressions
like ./ : /. An example is the pattern
normal ./ risk : Skinfold Triceps(5, 15i Educ(Univ) .0.4
/
Diast(60, 90i/Beer( 1) .
This SD4ft-pattern is verified using two contingency tables see Fig. 2. The
TN TR
40 4 44 47 46 93
89 22 111 231 155 386
129 26 155 278 201 479
table TN concerns normal patients drinking maximally one liter of beer/day. The
second table TR concerns risk patients drinking maximally one liter of beer/day.
Informally speaking the pattern means that the set of normal patients differs
from the set of risk patients what concerns the confidence of association rule
Skinfold Triceps (5, 15i Educ(Univ) Diast(60, 90i (abbreviated by )
if we consider patients drinking maximally one liter of beer/day. The difference of
confidences is 0.4. The confidence of on TN is 40 44 = 0.91, the confidence
of on TR is 47 93 = 0.51. Generation and verification of 31.6 106 of
SD4ft-patterns took 23 minutes at the PC with 1.58 GHz and 13 patterns with
difference of confidences 0.4 were found including the presented one.
The additional GUHA procedures implemented in the LISp-Miner system
are KL-Miner, CF-Miner, SDKL-Miner, and SDCF-Miner [7, 8].
3 Conclusions
Tens of runs of GUHA procedures of LISP-Miner system were done on various
health data. Some of them were very successful e.g. the analysis of data describing
thousands of catheterisations in General Faculty Hospital in Prague [9]. We can
summarize:
The principle of GUHA method - to offer all true patterns relevant to the
given analytical question - proved to be very useful. All the implemented
procedures are able to generate and verify huge number of relevant patterns
in reasonable time. The input parameters of GUHA procedures makes pos-
sible both to fit the generated patterns to a solved data mining problem and
to tune the number of output patterns in a reasonable way.
73
Simple conditions concerning frequencies from contingency tables are used
that makes results good understandable even for non-specialists. However
there are also patterns corresponding to statistical hypotheses tests see e.g.
[3, 68] that are intended for specialists.
Namely the procedure 4ft-Miner gives fast orientation in large data and it is
useful to combine it with additional analytical software.
It is important that the complexity of used algorithms is linearly dependent
on number of rows of analysed data matrix [6, 8]
However there are lot of related open problems. The first one is how to
efficiently use large possibilities of fine tuning of the definition of the set of
relevant patterns to be generated and verified. An other problem is how to
combine particular procedures when solving a given complex data mining task.
The next step of solution chain depends both on the goal of analysis and on
results of previous runs of procedures. Big challenge is automated chaining of
particular procedures of LISp-Miner to solve given problem.
Thus one of our next research goals is to build an open system called Ev-
erMiner of tools to facilitate solving real problems using all procedures imple-
mented in the LISp-Miner system. The EverMiner system will consists of typical
tasks, scenarios, repositories of definitions of sets of relevant patterns etc. [7].
References
1. Aggraval, R. et al: Fast Discovery of Association Rules. In Fayyad, U. M. et al.:
Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996. pp 307328
2. Burian, J. Rauch, J.: Analysis of Death Causes in the STULONG Data Set. In:
Berka, Petr (ed.). Discovery Challenge. Zagreb : IRB, 2003, pp. 4758
3. Hajek, P., Havranek, T.: Mechanizing Hypothesis Formation (Mathematical Foun-
dations for a General Theory), SpringerVerlag 1978.
4. P. Hajek (guest editor), International Journal of Man-Machine Studies, special issue
on GUHA, vol. 10, January 1978
5. P. Hajek and A. Sochorov a and J. Zv arov
a: GUHA for personal computers. Com-
putational Statistics & Data Analysis, vol 19, pp. 149 - 153, February 1995
unek M (2005) An Alternative Approach to Mining Association Rules.
6. Rauch J, Sim
In: Lin T Y, Ohsuga S, Liau C J, and Tsumoto S (eds) Data Mining: Foundations,
Methods, and Applications, Springer-Verlag, 2005, pp. 219 238
7. Rauch J, Sim unek M: GUHA Method and Granular Computing. In: Hu, X et al
(ed.). Proceedings of IEEE conference Granular Computing. 2005, pp. 630635.
unek Milan, Ln V
8. Rauch J, Sim aclav (2005) Mining for Patterns Based on Contin-
gency Tables by KL-Miner First Experience. In: Lin T Y et all (Eds.) Foundations
and Novel Approaches in Data Mining. Berlin. Springer-Verlag, pp. 155167.
9. Stochl J, Rauch J, Mr azek V.: Data mining in Medical Databases. In: Zv arov a J
et all (ed.). Proceedings of the International Joint Meeting EuroMISE 2004. Praha:
EuroMISE, 2004, p. 36
74
An Investigation into a Beta-Carotene/Retinol Dataset
Using Rough Sets
1 Introduction
Clinical studies have suggested that low dietary intake or low plasma concen-
trations of retinol, beta-carotene, or other carotenoids might be associated with in-
creased risk of developing certain types of cancer. Anti-oxidants have long been
known to reduce the risks of a variety of cancers, and substantial efforts have been
made in the research and clinical communities to find ways of elevating anti-oxidants
to levels that are therapeutic. [1]. In addition to well-known functions such as dark
adaptation and growth, retinoids have an important role in the regulation of cell differ-
entiation and tissue morphogenesis. Following numerous experimental studies on the
effects of retinoids on carcinogenesis, clinical use of retinoids has already been intro-
duced in the treatment of cancer (acute promyelocytic leukemia) as well as in the
chemoprevention of carcinogenesis of the head and neck region, breast, liver and
uterine cervix [2]. Given the importance of this class of chemicals in their role as
75
anti-carcinogens, we investigated a small dataset containing 315 patient records re-
corded over a 1-year period. The purpose of this study was to determine which attrib-
ute(s) that were collected had a positive correlation with plasma levels of these anti-
oxidants.
This dataset contains 315 observations on 14 variables. Two of the variables (attrib-
utes) consist of plasma levels of beta-carotene and retinol levels. The dataset was split
into two one containing the beta-carotene levels and all other attributes (except the
retinol levels), resulting in 12 attributes and one decision class. The same technique
was applied, leaving out the beta-carotene levels (retaining the retinol levels as the
decision class.) There were no missing values and the attributes consisted of both
categorical and continuous data. Since the decision attribute was continuous, it was
discretised such that a binary decision was obtained. Briefly, the distribution of the
decision class was generated and the median value was used as a cut point for the two
classes. In order to determine the effectiveness of this cut-off point, several trials with
a range of +/-2% of the median (up to 100%) were tried in an exhaustive manner to
find (empirically) the best threshold value. The best threshold value was determined
by the resulting classification accuracy. The attributes were discretised whenever
possible in order to reduce the number of rules that would be generated using rough
sets. Once the dataset had been transformed into a decision table, we applied the
rough set algorithm, described formally in the next section.
Rough set theory is a relatively new data-mining technique used in the discovery of
patterns within data first formally introduced by Pawlak in 1982 [3,4]. Since its in-
ception, the rough sets approach has been successfully applied to deal with vague or
imprecise concepts, extract knowledge from data, and to reason about knowledge
derived from the data [5,6]. We demonstrate that rough sets has the capacity to evalu-
ate the importance (information content) of attributes, discovers patterns within data,
eliminates redundant attributes, and yields the minimum subset of attributes for the
purpose of knowledge extraction.
3 Methods
The structure of the dataset consisted of 14 attributes, including the decision attribute
(labelled result) which was displayed for convenience in table 1 above. There were
4,410 entries in the table with 0 missing values. The attributes contained a mixture of
categorical (e.g. Sex) and continuous (e.g. age) values, both of which can be used by
rough sets without difficulty. The principal issue with rough sets is to discretise the
attribute values otherwise an inordinately large number of rules are generated. We
76
employed an entropy preserving minimal description length (MDL) algorithm to dis-
cretise the data into ranges. This resulted in a compact description of the attribute
values which preserved information while keeping the number of rules to a reasonable
number (see the results section for details). We determined the Pearsons Correlation
Coefficient of each attribute with respect to the decision class. Next, we partitioned
the dataset into training/test cases. We selected a 75/25% training testing scheme
(236/79 cases respectively) and repeated this process with replacement 50 times. The
results reported in this paper are the average of these 50 trials. We used dynamic
reducts, as experience with other rough sets based reduct generating algorithms (cf.
Rosetta) has indicated this provides the most accurate result [9]. Lastly, we created
the rules that were to be used for the classification purpose. The results of this process
are presented in the next section.
4 Results
After separating the beta-carotene decision from the retinol decision attribute, the
rough set algorithm was applied as described above from an implementation available
from the Internet (see reference [2]). In table 3 displayed below, samples of the
resulting confusion matrices are displayed. The confusion matrix provides data on the
reliability of the results, indicating true positives/negatives and false posi-
tives/negatives. From these values, one can compute the accuracy, positive predictive
Low High
Low 32 6 0.84
High 3 38 0.93
0.91 0.86 0.89
Table 4. A sample of the rules produced by the rough sets classifier. The rules com-
bine attributes in conjunctive normal form and map each to a specific decision class.
The * corresponds to an end point in the discretised range the lowest value if it
appears on the left hand side of a sub-range or the maximum value if it appears on the
right hand side of a sub-range.
77
resulting rules that were generated during the classification process. The rules gener-
ated are in an easy to read if attribute x = A then consequent = B.
5 Discussion
References
1. Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of plasma
levels of beta-carotene and retinol. American Journal of Epidemiology 1989;130:511-521
2. Krinsky, NI & Johnson, EJ, Department of Biochemistry, School of Medicine, Tufts Uni-
versity, 136 Harrison Avenue, Boston, MA 02111-1837, USA; Jean Mayer USDA Human
Nutrition Research Center on Aging at Tufts University, 136 Harrison Avenue, 711 Wash-
ington St, Boston, MA 02111-1837,USA.
3. Z. Pawlak . Rough Sets, International Journal of Computer and Information Sciences, 11,
pp. 341-356, 1982.
4. Pawlak, Z.: Rough sets Theoretical aspects of reasoning about data. Kluwer (1991).
5. J. Wroblewski.: Theoretical Foundations of Order-Based Genetic Algorithms. Funda-
menta Informaticae 28(3-4) pp. 423430, 1996.
6. D. Slezak.: Approximate Entropy Reducts. Fundamenta Informaticae, 2002.
7. K. Revett and A. Khan, A Rough Sets Based Breast Cancer Decision Support System,
METMBS, Las Vegas, Nevada, June, 2005
8. Gorunescu, F, Gorunescu, F,, El-Darzi, E, Gorunescu, S., & Revett K. A Cancer Diagno-
sis System Based on Rough Sets and Probabilistic Neural Networks, First European Con-
ference on Health care Modelling and Computation, University of medicine and Pharmacy
of Craiova, pp 149-159.
9. Rosetta: Rosetta: http://www.idi.ntnu.no/~aleks/rosetta
78
Data Mining Applications for Quality Analysis in
Manufacturing
Roland Grund
Abstract
Besides the traditional application areas like CRM data mining also offers inter-
esting capabilities in manufacturing. In many cases large amounts of data are
generated during research or production processes. This data mostly contains
useful information about quality improvement. Very often the question is how
a set of technical parameters influences a specific quality measure - a typical
classification problem. To know these interactions can be of great value because
already small improvements in production can save a lot of costs.
Like in general, also in manufacturing industry there is a trend towards sim-
plification and integration of data mining technology. In the past analysis was
usually done on smaller isolated data sets with a lot of handwork. Meanwhile
more and more data warehouses have become available containing well-organized
and up-to-date information. Modern software technology allows to build cus-
tomized analytical applications on top of these warehouses. Data Mining is inte-
grated, simplified, automated and it can be combined with standard reporting,
so that even an end-user can make use of it. In practice, however, the successful
implementation of such applications still includes a variety of challenges.
The talk shows examples from the automotive industry (DaimlerChrysler and
BMW) and provides a brief overview on IBMs recent data mining technology.
79
Towards Better Understanding of Circulating Fluidized
Bed Boilers: A Data Mining Approach
1
Department of MIT, University of Jyvskyl, P.O. Box 35, 40351 Jyvskyl, Finland
[email protected], [email protected], [email protected]
2
VTT Processes, P.O. Box 1603, 40101 Jyvskyl, Finland
{Antti.Tourunen, Heidi.Nevalainen}@vtt.fi
1 Introduction
80
in the bed, which may cause changes in the burning rate, oxygen level and increase
CO emissions. This is especially important, when considering the new biomass based
fuels, which have increasingly been used to replace coal. These new biofuels are often
rather inhomogeneous, which can cause instabilities in the feeding. These fuels are
usually also very reactive. Biomass fuels have much higher reactivity compared to
coals and the knowledge of the factors affecting the combustion dynamics is
important for optimum control. The knowledge of the dynamics of combustion is also
important for optimizing load changes [2].
The development of a set of combined software tools intended for carrying out
signal processing tasks with various types of signals has been undertaken at the
University of Jyvskyl in cooperation with VTT Processes1. This paper presents the
vision of further collaboration aimed to facilitate intelligent analysis of time series
data from circulating fluidized bed (CFB) sensors measurements, which would lead to
better understanding of underlying processes in the CFB reactor.
Currently there are three main topics in CFB combustion technology development;
once through steam cycle, scale up (600 800 MWe), and oxyfuel combustion.
The supercritical CFB combustion utilizes more cleanly, efficiently, and
sustainable way coal, biofuels, and multifuels, but need advanced automation and
control systems because of their physical peculiarities (relatively small steam volume
and absence of a steam drum). Also the fact that fuel, air, and water mass flows are
directly proportional to the power output of the boiler sets tight demands for the
control system especially in CFB operation where huge amount of solid material exist
in the furnace.
When the CFB boilers are becoming larger, not only the mechanical designs but
also the understanding of the process and the process conditions affecting heat
transfer, flow dynamics, carbon burnout, hydraulic flows etc. have been important
factors. Regarding the furnace performance, the larger size increases the horizontal
dimensions in the CFB furnace causing concerns on ineffective mixing of combustion
air, fuel, and sorbent. Consequently, new approaches and tools are needed in
developing and optimizing the CFB technology considering emissions, combustion
process, and furnace scale-up [3].
Fluidization phenomenon is the heart of CFB combustion and for that reason
pressure fluctuations in fluidized beds have been widely studied during last decades.
Other measurements have not been studied so widely. Underlying the challenging
objectives laid down for the CFB boiler development it is important to extract as
much as possible information on prevailing process conditions to apply optimization
of boiler performance. Instead of individual measurements combination of
information from different measurements and their interactions will provide a
possibility to deepen the understanding of the process.
1 Further information about the project, its progress and deliverables will be made available
from http://www.cs.jyu.fi/~mpechen/CFB_DM/index.html.
81
Fig. 1. A simplified view of a CFB boiler operation with the data mining approach
A very simplified view on how a CFB boiler operates is presented in the upper part
of Fig. 1. Fuel (mixture of fuels), air, and limestone are the controlled inputs to the
furnace. Fuel is utilized to heat production; air is added for enhancing the combustion
process and limestone is aimed at reducing the sulfur dioxides (SO2). The produced
heat converts water into steam that can be utilized for different purposes. The
measurements from sensors SF, SA, SL, SH, SS and SE that correspond to different input
and output parameters are collected in database repository together with other meta-
data describing process conditions for both offline and online analysis. Conducting
experiments with pilot CFB reactor and collecting their results into database creates
the necessary prerequisites for utilization of the vast amount of DM techniques aimed
to identifying valid, novel, potentially useful, and ultimately understandable patterns
in data that can be further utilized to facilitate process monitoring, process
understanding, and process control.
The estimation of boilers efficiency is not straightforward. The major estimates
that can be taken into account include ratio of produced volumes of steam to the
volumes of the consumed fuels, correspondence of volumes of emissions to
environmental laws, amortization/damage of parts of boilers equipment, reaction to
fluctuations in power demand, costs and availability of fuels, and others. Correspond-
ingly to these factors, a number of efficiency optimization problems can be defined.
82
However, developing a common understanding of what basic, enabling, and
strategic needs are and how they can be addressed with the DM technology is
essentially important. From the DM perspective, having input and output
measurements for processes, first of all we are interested; (1) to find patterns of their
relation to each other including estimation of process delays, level of gain, and
dynamics, (2) to build predictive models of emissions and steam volumes, (3) to build
predictive models of char load, having few measurements of it under different
conditions during the experiments (in a commercial plant there is no way to measure
char inventory on-line as it can be done with oxygen concentration and other output
parameters). Currently, we are concentrated on estimation of the responses of the
burning rate and fuel inventory to changes in fuel feeding. Different changes in the
fuel feed, such as an impulse, step change, linear increase, and cyclic variation have
been experimented with pilot CFB reactor. In [1] we focused on one particular task of
estimating similarities in data streams from the pilot CFB reactor, estimating
appropriateness of the most popular time-warping techniques to the particular domain.
We recognized basic, enabling, and strategic needs, defined what the current needs
are and focused on them, yet continuing to define the most important direction of our
further joint research efforts. Those include development of techniques and software
tools, which would help to monitor, control, and better understand the individual
processes (utilizing domain knowledge about physical dependence between processes
and meta-information about changing conditions when searching for patterns, and
extracting features at individual and structural levels). Further step is learning to
predict char load from few supervised examples that likely lead to adoption of active
learning paradigm to time series context. Progressing in these directions we, finally,
will be able to address the strategic needs related to construction of intelligent CFB
boiler, which would be able to optimize the overall efficiency of combustion
processes.
Thus, a DM approach being applied for time series data being accumulated from
CFB boilers can make critical advances addressing a varied set of the stated problems.
Acknowledgments. The work is carried out with a financial grant from the Research
Fund for Coal and Steel of the European Community (Contract No. RFC-CR-03001).
References
1. Pechenizkiy M. et al. Estimation of Similarities between the Signals from the Pilot CFB-
Reactor with Time Warping Techniques. Unpublished technical report (2006)
2. Saastamoinen, J., Modelling of Dynamics of Combustion of Biomass in Fluidized Beds,
Thermal Science 8(2), (2004) 107-126
3. Tourunen, A. et al. Study of Operation of a Pilot CFB-Reactor in Dynamic Conditions, In:
Proc. 17th Int. Conf. on Fluidized Bed Combustion, ASME, USA (2003) FBC 2003-073
83
Clustering of Psychological Personality Tests of
Criminal Offenders
1 Introduction
84
attempts have been made to integrate findings from available classification stud-
ies These efforts have suggested some potential replications of certain offender
types but have been limited by their failure to provide clear classification rules
e.g. psychopathic offenders have emerged from a large clinical literature but
there remains much dispute over how to identify them, and what specific social
and psychological causal factors are critical, and whether or not this type exists
among female offenders or among adolescents and whether there are sub-types
of psychopaths. Thus, a current major challenge in criminology is to address
whether reliable patterns or types of criminal offenders can be identified using
data mining techniques and whether these may replicate substantive criminal
profiles as described in the prior criminological literature.
In a recent study Juvenile offenders (N = 1572) from three U.S. state systems
were assessed using a battery of criminogenic risk and needs factors as well as
official criminal histories. Data mining techniques were applied with the goal
to identify intrinsic patterns in this data set and assess whether these replicate
any of the main patterns previously proposed in the criminological literature [23].
The present study aimed to identify patterns from this data and to demonstrated
that they relate strongly to only certain of the theorized patterns from the prior
criminological literature. The implications of these findings for Criminology are
manifold. The findings firstly suggest that certain offender pattern can be reliably
identified data using a variety of data mining unsupervised clustering techniques.
Secondly, the findings strong challenge those criminological theorists who hold
that there is only one general global explanation of criminality as opposed to
multiple pathways with different explanatory models (see [24]).
85
Thus, in this research we also demonstrate a methodology to identify well
clustered cases. Specifically we combine a semi-supervised technique with an
initial standard clustering solution. In this process we obtained highly replicated
offender types with clear definitions of each of the reliable and core criminal
patterns. These replicated clusters provide social and psychological profiles that
bear a strong resemblance to several of the criminal types previously proposed by
leading Criminologists [23,20]. However, the present findings firstly go beyond
these prior typological proposals by grounding the type descriptions in clear
empirical patterns. Secondly they provide explicit classification rules for offender
classification that have been absent from this prior literature.
2 Method
Bagging has been used with success for many classification and regression tasks
[9]. In the context of clustering, bagging generates multiple classification models
from bootstrap replicates of the selected training set and then integrates these
into one final aggregated model. By using only two-thirds of the training set to
create each model, we aimed to achieve models that should be fairly uncorrelated
so that the final aggregated model may be more robust to noise or any remaining
outliers inherent in the training set.
In [6] a method combining Bagging and K-means clustering is introduced. In
our analyses we used the K-means implementation in R [7]. We generated 1000
random bags from our initial sample of 1,572 cases with no outliers removed
to obtain cluster solutions for each bag. The centers of these bags were then
treated as data points and re-clustered with K-means. The final run of this K-
means was first seeded with the centers from our initial solution, which was then
tested against one obtained with randomly initialized centers. These resulted in
the same solution, suggesting that initializing the centers in these ways did not
unduly bias K-means convergence. The resulting stable labels were then used
as our final centers for the total dataset and in the voting procedure outlined
below.
86
2.2 Semi-Supervised Clustering
87
consensus model - is shown by the almost identical matching of these core cases
between the consensus model and the bagged K-means solution ( = .992, =
.994) and also to the original K-means ( = 0.949, = 0.947).
3 Results
The core clusters obtained with this method were interpreted and relationships
with types already identified in the criminology literature examined.
The clusters identified were Internalizing Youth A[20,13,16], Socialized Delin-
quents [12,14,15], Versatile Offenders[20], Normal Accidental Delinquents[18],
Internalizing Youth B[20], Low-control Versatile Offenders[20,21] and Norma-
tive Delinquency [19]. All the clusters relate to types that have been previously
identified various studies in the Criminology literature, but were never identified
at the same time in one data set using clustering.
External validation requires finding significant differences between clusters
on external (but relevant) variables that were not used in cluster development.
By comparing the means and bootstrapped 95 percent confidence intervals of
four external variables across the seven clusters from the core consensus solu-
tion we identified those variables. The external variables include three criminal
history variables (total adjudications, age-at-first adjudication and total violent
felony adjudications) and one demographic variable (age-at-assessment). These
plots show a number of significant differences in expected directions. For exam-
ple, clusters 4 and 7, which both match the low risk profile of Moffitts AL type
[18] have significantly later age-at-first adjudication compared to the higher risk
cluster 6 that matches Moffitts high risk LCP and Lykkens [20] Secondary Psy-
chopath. This latter cluster has the earliest age-at-first arrest and significantly
higher total adjudications - which is consistent with Moffitts descriptions.
Finally, while our results indicate that boundary conditions of clusters are
obviously unreliable and fuzzy, the central tendencies or core membership appear
quite stable. This suggests that these high density regions contain sufficient
taxonomic structure to support reliable identification of type membership for a
substantial proportion of juvenile offenders.
Using the method in Section 2 we were able to remove most of the hybrid
cases. In fact, the case removal was overly aggressive and removed roughly half
the data set. However, the remaining cases were very interpretable on manual
inspection. Our analyses also show that cluster boundaries are relatively unsta-
ble. Kappas from 0.55 to 0.70, although indicating general overlap, also imply
that boundaries between clusters may be imposed differently, and cases close to
boundaries may be unreliably classified across adjacent clusters. Many of these
cases may be regarded as hybrids with many co-occurring risk or needs and
multiple causal influences. Lykken [20] recognized this by stating that many
offenders will have mixed etiologies and will be borderline or hybrid cases (p.
21).
The presence of hybrids and outliers appears unavoidable given the multi-
variate complexity of delinquent behavior, the probabilistic nature of most risk
factors and multiplicity of causal factors. Additionally, our findings on boundary
88
conditions and non-classifiable cases must remain provisional since refinements
to our measurement space may reduce boundary problems. Specifically, it is
known that the presence of noise and non-discriminating variables can blur cat-
egory boundaries [10]. Further research may clarify the discriminating power of
all classification variables (features) and gradually converge on a reduced space
of only the most powerful features.
4 Conclusion
In this paper we report on our experiences with finding clusters in the Youth
COMPAS data set which contains 32 scale scores used for criminogenic assess-
ment.
Cluster analysis methods (Wards method, standard k-means, bagged k-
means and a semi-supervised pattern learning technique) were applied to the
data. Cross-method verification and external validity were examined. Core or ex-
emplar cases were identified by means of a voting (consensus) procedure. Seven
recurrent clusters emerged across replications.
The clusters that were found using unsupervised learning techniques partially
replicate several criminal types that have been proposed in previous criminolog-
ical research. However, the present analyses provide more complete empirical
descriptions than in most previous studies and allow. Additionally, the presence
of certain sub-types among these major types is suggested by the present anal-
ysis. This is the first study in which most of the well replicated patterns were
identified purely from the data. We stress that many prior studies provided only
partial theoretical or clinical descriptions, omit operational type-identification
procedures or provide very limited feature sets.
We introduced a novel way of hybrid-case elimination in an unsupervised
setting and although we are still working on establishing a more theoretical
foundation of the technique it has generally resulted in good results and very
interpretable clusters. From the resulting clusters a classifier was build from the
data in order to classify new cases.
It is noteworthy that the initial solution we obtained with an elaborate outlier
removal process using Wards linkage and regular K-Means was easily replicated
using Bagged K-Means without outlier removal or other manual operations.
In this instance Bagged K-Means appears to be very robust against noise and
outliers.
References
1. Zhou, D., Bousquet, O., Lal, T., Weston, J., Sch olkopf, B.: Learning with local
and global consistency. In S. Thrun, L.S., Sch olkopf, B., eds.: Advances in Neural
Information Processing Systems 16, Cambridge, Mass., MIT Press (2004)
2. Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data via
the em algorithm. Journal of the Royal Statistical Society 39 (1977)
3. Dimitriadou, E., Weingessel, A., Hornik, K.: Voting-merging: An ensemble method
for clustering. In: Lecture Notes in Computer Science. Volume 2130., Springer Verlag
(2001) 217
89
4. Topchy, A.P., Jain, A.K., Punch, W.F.: Combining multiple weak clusterings. In:
Proceedings of the ICDM. (2003) 331338
5. Lin, C.R., Chen, M.S.: Combining partitional and hierarchical algorithms for robust
and efficient data clustering with cohesion self-merging. In: IEEE Transactions on
Knowledge and Data Engineering. Volume 17. (2005) 145 159
6. Dolnicar, S., Leisch, F.: Getting more out of binary data: Segmenting markets by
bagged clustering. Working Paper 71, SFB Adaptive Information Systems and
Modeling in Economics and Management Science (2000)
7. R Development Core Team: R: A language and environment for statistical comput-
ing. R Foundation for Statistical Computing, Vienna, Austria. (2004) 3-900051-07-0.
8. Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor
meaningful? Lecture Notes in Computer Science 1540 (1999) 217235
9. Breiman, L. (1996). Bagging predictors. Machine Learning 24(2): 123140.
10. Milligan, G. W. (1996). Clustering validation: Results and implications for ap-
plied analyses. In Arabie, P., Hubert, L., and De Soete, G. (eds.), Clustering and
Classification, World Scientific Press, River Edge, NJ, pp. 345379.
11. Han, J., and Kamber, M. (2000). Data Mining - Concepts and Techniques, Morgan
Kauffman, San Francisco.
12. Miller, W. (1958). Lower-class culture as a generating milieu of gang delinquency.
Journal Of Social Issues 14: 519.
13. Miller, M., Kaloupek, D. G., Dillon, A. L., and Keane, T. M. (2004). Externalizing
and internalizing subtypes of combat-related PTSD: A replication and extension
using the PSY-5 scales. Journal of Abnormal Psychology 113(4): 636645.
14. Jesness, C. F. (1988). The Jesness Inventory Classification System. Criminal Jus-
tice and Behavior 15(1): 7891.
15. Warren, M. Q. (1971). Classification of offenders as an aid to efficient management
and effective treatment. Journal of Criminal Law, Criminology, and Police Science
62: 239258.
16. Raine, A., Moffitt, T. E., and Caspi, A. (2005). Neurocognitive impairments in
boys on the life-course persistent antisocial path. Journal of Abnormal Psychology
114(1): 3849.
17. Brennan, T., and Dieterich, W. (2003). Youth COMPAS Psychometrics: Reliability
and Validity, Northpointe Institute for Public Management, Traverse City, MI.
18. Moffitt, T. E. (1993). Adolescence-limited and life-course persistent antisocial be-
havior: A developmental taxonomy. Psychological Review 100(4): 674701.
19. Moffitt, T. E., Caspi, A., Rutter, M., and Silva, P. A. (2001). Sex Differences in
Antisocial Behaviour, Cambridge University Press, Cambridge,Mass.
20. Lykken, D. (1995). The Antisocial Personalities, Lawrence Erlbaum, Hillsdale, N.J.
21. Mealey, L. (1995). The sociobiology of sociopathy: An integrated evolutionary
model. Behavioral and Brain Sciences 18(3): 523599.
22. Farrington, D.P. (2005) Integrated Developmental and Life-Course Theories of Of-
fending, Transaction Publishers,London (UK).
23. Piquero A.R. and Moffitt T.E. (2005) Explaining the facts of crime: How devel-
opmental taxonomy replies to Farringtons Invitation Chapter in Farrington D.P
(Ed) Integrated Developmental and Life-Course Theories of Offending, Transaction
Publishers,London (UK).
24. Osgood D. W. (2005). Making sense of crime and the life course Annals of AAPSS,
November 2005, 602:196-211.
90
True labels Consistency Method F* Nearest Neighbor 7
KMeans
7
Consistency
10 10 10
6 6
5 5 5 5 5
4 4
0 0 0 3 3
5 5 5 2 2
0 5 0 5 0 5 1 1
10 10 10
1 1
1 0 1 2 3 4 5 6 1 0 1 2 3 4 5 6
5 5 5
5 5 5 6 6
0 5 0 5 0 5 5 5
5 5 5 2 2
1 1
0 0 0
0 0
5 5 5 1
1 0 1 2 3 4 5 6
1
1 0 1 2 3 4 5 6
0 5 0 5 0 5
(a) (b)
Class Label
Inputs 1 (n=83) 2 (n=103) 3 (n=85) 4 (n=151) 5 (n=197) 6 (n=146) 7 (n=130)
FamCrime
SubTrbl
Impulsiv
ParentConf
ComDrug
Aggress
ViolTol
PhysAbuse
PoorSuper
Neglect
AttProbs
SchoolBeh
InconDiscp
EmotSupp
CrimAssoc
YouthRebel
LowSES
FamilyDisc
NegCognit
Manipulate
HardDrug
SocIsolate
CrimOpp
EmotBonds
LowEmpath
Nhood
LowRemor
Promiscty
SexAbuse
LowProsoc
AcadFail
LowGoals
(c)
Fig. 1. (a) Consistency Method: two labeled points per class (big stars) are used to
label the remaining unlabeled points with respect to the underlying cluster structure.
F denotes the convergence of the series. (b) Toy example: Three Gaussians with
hybrid cases in between them. Combining the labels assigned by K-Means (top, left)
and the Consistency Method (top, right; bottom, left) with two different results in
the removal of most of the hybrid cases (blue dots; bottom, right) by requiring con-
sensus between all models build. The K-Means centers have been marked in magenta.
(c) Resulting Cluster Means: Mean Plots of External Criminal History Measures
Across Classes from the Core Consensus Solution with Bootstrapped 95% Confidence
Limits.
91
Onto Clustering of Criminal Careers
1 Introduction
The Dutch national police annually extracts information from digital narrative
reports stored throughout the individual departments. This data is compiled
into a large and reasonably clean database that contains all criminal records
from the last decade. This paper discusses a new tool that attempts to gain new
insights into the concept of criminal careers, the criminal activities that a single
individual exhibits, from this data.
The main contribution of this paper is in Section 4, where the criminal profiles
are established and a distance measure is introduced.
2 Background
Background information on clustering techniques in the law enforcement arena
can be found in [1, 4]. Our research aims to apply multi-dimensional cluster-
ing to criminal careers (rather than crimes or linking perpetrators) in order
to constitute a visual representation of classes of these criminals. A theoretical
background to criminal careers and the important factors can be found in [2].
3 Approach
We propose a criminal career analyzer, which is a multi-phase process visual-
ized in Figure 1. Our tool normalizes all careers to start on the same point
in time and assigns a profile to each offender. After this step, we compare all
possible pairs of offenders on their profiles and profile severity. We then employ
a specifically designed distance measure that incorporates crime frequency and
the change over time, and finally cluster the result into a two-dimensional image
using the method described in [3].
This research is part of the DALE project (Data Assistance for Law Enforcement)
as financed in the ToKeN program from the Netherlands Organization for Scientific
Research (NWO) under grant number 634.000.430
92
National John Doe P 1 Murder 1 Crime Nature,
Database of
LAYING PLANS
1. Sun Tzu said: The art of war is of
vital importance to the State.
2. It is a matter of life and death, a
road either to safety or to ruin. Hence
Jane Doe P 8 Murder 9
Frequency,
Factor Extraction
it is a subject of inquiry which can on
no account be neglected.
3. The art of war, then, is governed by
The Duke C 3 Hazard 27
Seriousness
five constant factors, to be taken into
account in one's deliberations, when
Criminal
seeking to determine the conditions
obtaining in the field.
4. These are:(1) The Moral Law;
(2) Heaven;
(3) Earth; [email protected] E 1 Murder 9
(4) The Commander;
(5) Method and discipline.
5,6. The Moral Law causes the people to
and Duration
be in complete accord with their ruler,
Violent U 2 Starwars 1
Records
so that they will follow him regardless
of their lives, undismayed by any
danger.
Profile Creation
Criminal 1
1
2
3
109
109
864
723
29
61
Profile per
1 4 109 314 33
1 5 109 577 57
1 6 109 38 8
Frequency
Offender per
2 3 864 723 198
Duration
2 4 864 314 102
2 5 864 577 265
2 6 864 38 2
Year
Profile Difference
Distance 0
0.10
0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18
0 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26
Year
0.17 0.25 0.32 0.38 0.43 0.47 0.50 0.52 0 0.54
Distance Measure
Distance per
Year Graph
Distance 0
0.10
0.10 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18
0 0.19 0.20 0.21 0.22 0.23 0.24 0.25 0.26
Matrix
0.11 0.19 0 0.27 0.28 0.29 0.30 0.31 0.32 0.33
0.12 0.20 0.27 0 0.34 0.35 0.36 0.37 0.38 0.39
0.13 0.21 0.28 0.34 0 0.40 0.41 0.42 0.43 0.44
including
0.14 0.22 0.29 0.35 0.40 0 0.45 0.46 0.47 0.48
0.15 0.23 0.30 0.36 0.41 0.45 0 0.49 0.50 0.51
0.16 0.24 0.31 0.37 0.42 0.46 0.49 0 0.52 0.53
0.17 0.25 0.32 0.38 0.43 0.47 0.50 0.52 0
Frequency
0.54
14
16 be used by
13
9
15
117
police force
12 4 5
This intermediate distance matrix describes the profile difference per year for
each possible pair of offenders. Its values all range between 0 and 4.
The crime frequency, or number of crimes, will be divided into categories (0, 1,
25, 510, >10 crimes per year) to make sure that the absolute difference shares
93
the range 04 with the calculated profile difference, instead of the unbounded
number of crimes per year offenders can commit. The frequency value difference
will be denoted by FVD xy .
Criminal careers of one-time offenders are obviously reasonably similar, al-
though their single crimes may differ largely in category or severity class. How-
ever, when looking into the careers of career criminals there are only minor
differences to be observed in crime frequency and therefore the descriptive value
of profile becomes more important. Consequently, the dependence of the profile
difference on the crime frequency must become apparent in our distance mea-
sure. This ultimately results into a proposal for the crime difference per year
distance CPDYxy between persons x and y:
1
PD xy FVD xy + FVD xy
4 PD xy 1
CDPYxy = = FVD xy + (2)
8 32 8
5 Experimental Results
Figure 2 gives an impression of the output produced by our tool when analyzing
the beforementioned database.
Severe
career
criminals
Multiple
crimes per
offender,
varying
in other
categories
94
This image clearly shows what identification could easily be coupled to the
appearing clusters after examination of its members. It appears to be describing
reality very well. The large cloud in the left-middle of the image contains (most
of the) one-time offenders. This seems to relate to the database very well since
approximately 75% of the people it contains has only one felony or misdemeanour
on his or her record. The other apparent clusters also represent clear subsets of
offenders.
References
1. R. Adderley and P. B. Musgrove. Data mining case study: Modeling the behavior
of offenders who commit serious sexual assaults. In KDD 01: Proceedings of the
Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining, pages 215220, New York, 2001.
2. A. Blumstein, J. Cohen, J. A. Roth, and C. A. Visher. Criminal Careers and Career
Criminals. The National Academies Press, 1986.
3. J. Broekens, T. Cocx, and W.A. Kosters. Object-centered interactive multi-
dimensional scaling: Lets ask the expert. To be published in the Proceedings of the
18th BeNeLux Conference on Artificial Intelligence (BNAIC 2006), 2006.
4. T.K. Cocx and W.A. Kosters. A distance measure for determining similarity be-
tween criminal investigations. In Proceedings of the Industrial Conference on Data
Mining 2006 (ICDM2006), LNCS. Springer, 2006.
5. M. Williams and T. Muzner. Steerable, progressive multidimensional scaling. In
IEEE Symposium on Information Visualization (INFOVIS04), pages 5764. IEEE,
2004.
95
Sequential patterns extraction in multitemporal
satellite images
Abstract. The frequency and the quality of the images acquired by re-
mote sensing techniques are today so high that end-users can get high
volumes of observation data for the same geographic area. In this pa-
per, we propose to make use of sequential patterns to automatically
extract evolutions that are contained in a satellite images series, which
is considered to be a base of sequences. Experimental results using data
originating from the METEOSAT satellite are detailed.
1 Introduction
Data mining techniques that aim at extracting local patterns can be successfully
applied to process spatial data. For example, when considering geographical
information systems, one can find association rules such as if a town is large
and if it is intersected by a highway then this town is located near to large
surfaces of water [4]. When dealing with satellites images, it is also possible
to extract dependencies such as if visible reflectance intensity ranges from 192
to 255 for the green band and if infrared reflectance intensity ranges from 0
to 63, then high yield is expected [6]. In this paper, we propose an original
approach based on the use of sequential patterns [2] for analyzing multitemporal
remote sensing data [1, 3]. Indeed, as sequential patterns can include temporal
order, they can be used for extracting frequent evolutions at the pixel level, i.e.
frequent evolutions that are observed for geographical zones that are represented
by pixels. Section 2 gives a brief introduction to sequential pattern mining while
Section 3 details experiments on METEOSAT images.
96
build for each pixel a sequence of 10 values that are ordered w.r.t. temporal
dimension. This sequence of 10 values can be translated into a sequence of 10
symbols where each symbol is associated to a discretization interval. At the
image level, this means that we can get a set of millions of short sequences of
symbols, each sequence describing the evolution of a given pixel. This context
has been identified in data mining as a base of sequences [2]. More precisely, a
sequence is an ordered list of L events 1 , 2 , . . . , L which is denoted by 1
2 ... L and where each event is a non-empty set of symbols1 . Let us
consider a toy example of a base of sequences, B = {A K J B C
C, C V T A B K, A G K B J K, A K
J M V C}. This base describes the evolution of 4 pixels throughout 6
images. For example, values of one pixel are at level A in the first image, level K
in the second, level J in the third, level B in the fourth, level C in the fith and
level C in the sixth image. In this dataset, one can extract sequential patterns
such as A B K. If such a pattern occurs in a sequence describing the
evolution of a pixel, then the value of this pixel is A at a precise date, then
this value changes to B sometime later before further changing to K. In more
detail, a sequential pattern 1 2 . . . n is a sequence and it occurs in a
sequence 1 2 . . . m if there exist integers 1 i1 < i2 < . . . < in m
such that 1 = i1 , 2 = i2 , . . ., n = in . A pattern is considered to be
frequent if it occurs at least once in more than sequences, where is a user-
defined threshold, namely the minimum support. Back to our toy example, if
is set to 3/4, A B turns to be a frequent sequential pattern. To sum up, if
we consider an image series as a base of sequences where each sequence traces
the evolution of a given pixel, it is possible to find frequent evolutions at the
pixel level by extracting frequent sequential patterns. To do so, we can rely on
the various complete algorithms that have been designed to extract frequent
sequential patterns (e.g. [2, 7, 5]).
3 Experiments
We used M. J. Zakis public prototype (http://www.cs.rpi.edu) that implements
in C++ the cSPADE algorithm [7]. The images we processed are visible band
images (0.5 - 0.9 m) originating from European geostationary satellite ME-
TEOSAT that are encoded in a 256 gray scale format. They can be accessed
for free in a slightly degraded JPEG format at http://www.sat.dundee.ac.uk 2 .
We decided to use this data as the interpretation is quite straightforward when
dealing with low definition and visible band images. The aim was to test this
approach for further analyzing high definition images such as radar images from
European Remote Sensing (ERS) satellites. Regarding METEOSAT data, we
selected images that all cover the same geographic zone (North Atlantic, Eu-
rope, Mediterranean regions, North Africa) and that contains 2 262 500 pixels
(905x2500). The 8 images we selected were acquired on the 7th, 8th, 9th, 10th,
1
We refer the reader to [2] for more generic and formal definitions
2
Website of the NERC satellite images receiving station of Dundee University.
97
11th, 13th, 14th and 15th of April 2006 at 12.00 GMT. We chose this time to
get maximum exposure over the geographical zones covered by images. We then
rediscretized the 256 grey levels into 4 intervals in order (1) to reduce the effects
due to the acquisition process and to the JPEG degradation, (2) to facilitate
results interpretation. Symbols 0, 1, 2, 3 respectively relate to intervals [0 50]
(water or vegetation), ]50, 100] (soil or thin clouds), ]100, 200] (sand or relatively
thick clouds), and ]200, 255] (thick clouds, bright sand or snow). First patterns
appear at = 77.5% (execution time3 = 2 s). It is worth noting that the number
of extracted patterns does not exceed 110 until = 10% (execution time = 54
s) which facilitate interpretation of results. The first phenomena to be reported
a) b)
The last discovered phenomena shows that some pixels did not change over the
image series. For example, we found 0 0 0 0 0 0 0 0
at = 1.4%, which means that some ocean zones were not covered by clouds.
3
All experiments were run on a AMD Athlon(tm) 64 3000+ (1800MHz) platform with
512 MB of RAM under SUSE Linux 10.0 operating system (kernel 2.6.13-15-default).
98
Another interesting pattern is 3 3 3 3 3 3 3 3 at = 0.7%.
As Figure 1 shows, this pattern is located in the Alps (snowy zones, upper
part of the image) and in North Africa (bright sand zones, lower part of the
image). We here presented 24 patterns out of 110 whose minimum support is
greater or equal to 10%. Thus, about 1/4 of extracted patterns can be considered
of interest. Results can be refined by adding an infrared band. As an example,
other experiments show that it permits to make distinction between snowy zones
and bright sand zones. A final interesting result is that when localizing frequent
sequential patterns, coherent spatial zones appear (e.g. maritime zones, snowy
zones). We began to obtain similar results on radar images from ERS satellite
covering the Alps by exhibiting geographical features such as glaciers.
4 Conclusion
We propose to consider a satellite images series as a base of sequence in which
evolutions can be traced thanks to sequential patterns. First experiments confirm
the potential of this approach by exhibiting well known phenomena. Future works
include refined preprocessing such as scaling up from pixel level to region level
by using both knowledge of the domain and signal processing.
References
1. IEEE transactions on geoscience and remote sensing : special issue on analysis of
multitemporal remote sensing images, November 2003. Volume 41, Number 11,
ISSN 0196-2892.
2. R. Agrawal and R. Srikant. Mining sequential patterns. In P. S. Yu and A. S. P.
Chen, editors, Proc. of the 11th International Conference on Data Engineering
(ICDE95), pages 314, Taipei, Taiwan, 1995. IEEE Computer Society Press.
3. F. Bujor, E. Trouve, L. Valet, J.-M. Nicolas, and J.-P. Rudant. Application of log-
cumulants to the detection of spatiotemporal discontinuities in multitemporal SAR
images. IEEE Transactions on Geoscience and Remote Sensing, 42(10):20732084,
2004.
4. K. Koperski and J. Han. Discovery of spatial association rules in geographic in-
formation databases. In M. J. Egenhofer and J. R. Herring, editors, Proc. 4th Int.
Symp. Advances in Spatial Databases, SSD, volume 951, pages 4766. Springer-
Verlag, 69 1995.
5. J. Pei, B. Han, B. Mortazavi-Asl, and H. Pinto. Prefixspan: Mining sequential
patterns efficiently by prefix-projected pattern growth. In Proc. of the 17th Inter-
national Conference on Data Engineering (ICDE01), pages 215226, 2001.
6. W. Perrizo, Q. Ding, Q. Ding, and A. Roy. On mining satellite and other remotely
sensed images. In Proceedings of ACM SIGMOD Workshop on Research Issues in
Data Mining and Knowledge Discovery, pages 3340, Santa Barbara, California,
May 2001.
7. M. Zaki. Sequence mining in categorical domains: incorporating constraints. In Proc.
of the 9th International Conference on Information and Knowledge Management
(CIKM00), pages 422429, Washington, DC, USA, November 2000.
99
Carancho - A Decision Support System for
Customs
1 Introduction
Foreign commerce has historically been of great importance as an economical and
political instrument worldwide. Tax policy, among other instruments, are forms
by which a government controls the trade of goods and services. The problem
is that, as one might expect, whenever there is someone charging taxes, there is
also someone else trying to avoid paying them.
Unfortunately, this is not only a simple matter of tax evasion. According to
the U.S. Department of State [1], practises such as commodities overvaluation
allow corporations to perform fund transfers transparently, possibly concealing
money laundering schemes. These activities can be directly connected to drug
traffic and smuggling. Under this light, an apparently minor tax-evasion offence
could be just the tip of the iceberg of more severe crimes.
Some of the common problems found on customs transactions are:
Over/Undervaluation: As mentioned before, incorrect price estimation can
conceal illicit transfers of funds. Money laundering schemes in the case of
overvaluation, and tax evasion offences in the case of undervaluation.
Classification errors: When assigning a product to one of the predefined
categories, the importer can make a honest mistake, misclassifying it. Nev-
ertheless, such a mistake could also conceal an instance of tax evasion, should
the misclassification lead to lower tax charges.
100
Origin errors: Such errors happen when an importer incorrectly declares the
goods country of origin. This is a direct effect of special customs restrictions
regarding particular combinations of goods and their origin.
Smuggling: Sometimes importers smuggle different materials into the country
amongst the goods they are actually importing. This problem, as reported in
[1], could also conceal money laundering schemes, specially when the smug-
gled material is a high-valued commodity, such as gold.
In this paper, we describe Carancho, a graphical decision support system
designed to help customs officers decide what should be inspected taking into
account all the past international operations.
101
2.1 Graphical-Assisted Manual Selection
The manual selection approach relies on the expertise and experience of the cus-
toms officer to determine what could be considered a suspicious import. Within
this framework, the officer is responsible for deciding about the inspection of
some goods based on a graphical representation of all operations which are sim-
ilar to the one under evaluation.
Inspectors can define the level of similarity on the fly, applying filters to the
database of transactions. The program retrieves the filtered relevant information,
groups the data according to a set of predefined dimensions, and plots their
histograms. This is a poor mans attempt to display the transactions distribution
according to these dimensions. The inspector can then, based on this overview of
all the similar transactions, decide whether to inspect the cargo more carefully
or to clear it right away.
The batch search is appropriate for inspection of all operations that have already
been cleared, but that may deserve later investigation on documentation and
financial statements. Basically, the system processes all the database, looking
for outliers. This search is an attempt to automate the manual approach, using
standard statistical estimation techniques.
Outliers detection takes place as follows. The system searches the whole
database, grouping import documents according to their product code4 . It then
analyses each group according to the aforementioned dimensions.
For each of the strategic dimensions, the system finds its robust mean and
standard deviations. Then, it looks for transactions that are more than three
standard deviations away from the mean in any of these dimensions. That is, it
defines a dimension outlier as any import operation whose declared value, for a
specific dimension d, is
The results of outlier detection are sent to a file, for later analysis.
3 Results
We ran the batch mode of the system on a 1.5 million transactions database.
The outlier detection rate was around 0.9%. That is, from the total amount of
import operations, around 14000 were considered outliers.
4
Similar to the Harmonised System, developed by the World Customs Organisation.
102
This is a very interesting result, for it allows the customs department to pay
special attention to these strange operations, as opposed to deciding which
ones should be cleared and which should be inspected on a more subjective
basis. This saves time and effort, not to mention financial resources.
Naturally, we have to bear in mind that being an outlier does not imply being
an outlaw. Some outlier operations are in fact proper ones, they simply do not
fit the pattern followed by the majority of similar operations. Nevertheless, this
system will increase the customs officers efficiency, making the decision process
swifter and speeding the customs process.
We are still working on assessing the precision of our system. We choose not
to use recall as an evaluation measure because there is no way of telling which
irregular operations escaped detection.
4 Conclusion
In this paper we presented Carancho a decision support system for the customs
department, aimed at detecting some types of frauds.
We proposed a new approach to the problem, set in two fronts. The first
front is intended to help customs officers to make their decisions more quickly,
by showing them a graphical representation of the distribution of similar import
operations in the past.
The second front is intended to detect those operations that are very unusual
but, for some reason, were cleared from physical inspection. In sum, the first
approach is meant to detect outliers as they happen, on the fly, whereas the
second one is meant to detect past outliers.
Another novelty of our model is the fact that we broke up a multi-dimensional
vector space (the dimensions analysed by the system) into a set of one-dimensional
spaces, so that we could look for outliers in each of them separately, and then
combine the results. Despite the fact that this measure simplified our task, we
still need to determine how good a decision it was.
References
1. U.S. Department of State: International Narcotics Control Strategy Report. Volume
II: Money Laundering and Financial Crimes. (2005)
2. Bolton, R.J., Hand, D.J.: Statistical fraud detection: A review. Statistical Science
17(3) (2002) 235255
3. Singh, A.K., Sahu, R.: Decision support system for hs classification of commodities.
In: Proceedings of the 2004 IFIP International Conference on Decision Support
Systems (DSS 2004). (2004)
103
Bridging the gap between commercial and
open-source data mining tools: a case study
1 Introduction
In the last decade the information overload caused by the massive influx of raw
data has caused traditional data analysis to become insufficient. This resulted
in a new interdisciplinary field of data mining, encompassing both classical sta-
tistical, and modern machine learning tools to support the data analysis and
knowledge discovery from databases. Every year more and more software com-
panies are including elements of data mining into their products. Most notably,
almost all major database vendors now offer data mining as an optional module
within business intelligence. Data mining is therefore integrated into the prod-
uct. While such an integration provides several advantages over the client-server
data mining model, there are also some disadvantages. Among them, most im-
portant is the complete dependence on the data mining solution as provided by
the vendor. On the other hand, most free, open source data mining tools offer
unprecedented flexibility and are easily extendable.
We propose a server-side approach that has all the advantages (security and
integration) of the commercial data mining solutions while retaining the flexibil-
ity of open-source tools. We illustrate our solution with a case study of coupling
an open-source data mining suite Weka [5] into Oracle Database Server [2].
2 Methodology
There exist several different possibilities for coupling data mining into database
systems [3]. Of those, experiments show that the Cache-Mine [3] approach (data
read from SQL tables is cached; data mining algorithms are written in SQL as
stored procedures) is the preferable in most cases.
104
Since Oracle Database Server includes a server-side Java virtual machine,
extending it with Weka is relatively straightforward, once you get through the
sufficient amount of user manuals. Our approach is rather similar to the Cache-
Mine with data being cached in RAM memory due to Wekas requirements, and
stored procedures being just interfaces to Java classes. Fig. 1 shows an archi-
tecture of the extended system. Compiled Java classes or whole JAR archives
are uploaded to the server using Oracle-provided tools (loadjava). On the server,
appropriate methods are published (in Oracle terminology) and thus made ac-
cessible from PL/SQL as stored procedures [1]. Naturally, all Weka classes are
accessible from server-side Java programs. Connections to the database are made
through JDBC using Oracles server-side shortcut names. By using the shortcut
names database connections remain local and data does not leave the server at
any time.
Server
Application
(Java)
Client
3 Experimental evaluation
105
Table 1. A comparison of prediction performance between Weka and Oracle Data
Mining on five domains.
naive Bayes) in Weka often (although not always) perform slightly better than
their Oracle equivalents (Tab. 1).
On the other side, comparing client- and server-side Weka shows that Oracles
server-side Java virtual machine is about ten times slower than Suns client-side
implementation, when compared on the same computer. Oracle are aware of
this problem and claim that their virtual machine is intended for fast transac-
tion processing and not heavy computation. For computationally-intensive Java
applications they offer ahead-of-time (native) compilation of Java classes and
JARs. By this approach Java bytecode is translated to C source and further
compiled into dynamic link libraries (DLLs) that can be accessed from Java
through Java Native Interface. Compiled bytecode can be up to ten times faster
than interpreted one. Fig. 2 shows model building times for different configura-
tions (Weka on client and server, natively compiled Weka on server ncomp).
We also compared model building times between Weka and Oracle Data
Mining (Fig. 3). For problem sizes of up to 100.000 rows (in this case about
10 megabytes), Weka algorithms perform quicker. For larger problems, Oracle
data mining are faster since they are specialized to work with large collections
of data.
Model build
275
Server
250 Client
225 Server (ncomp)
200
175
Time [s]
150
125
100
75
50
25
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of rows (examples)
Fig. 2. Building decision tree models with Weka on large KDD datasets.
106
Model build
10000
Weka (server)
Weka (client)
ODM
1000
Time [s]
100
10
50 100 150 200 250 300 350 400 450 500 550
Number of examples [in 1000]
4 Conclusion
In conclusion, Weka can be pretty successfully used within the Oracle Database
Server. It is accessible both from Java and PL/SQL applications. Built models
can be stored in a serialized form and later reused. Since data does not leave
the server it is much more secure than in the usual client-server approach. One
of our business customers (a large telecommunication company) who absolutely
refused using a client-server approach and insisted exclusively on server-side
data mining, now claim that they would be willing to use such a server-side data
mining extension.
References
[1] Oracle Database Java Developers Guide 10g Release 1 (10.1). Oracle, 2004.
[2] Oracle Data Mining Application Developers Guide. Oracle, 2005.
[3] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with
relational database systems: alternatives and implications. In Proc. 1998 ACM-
SIGMOD, pages 343354, 1998.
[4] M. Taft, R. Krishnan, M. Hornick, D. Muhkin, G. Tang, S. Thomas, and P. Sten-
gard. Oracle Data Mining Concepts, 10g Release 2 (10.2). 2005.
[5] I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and
Techniques. Morgan Kaufmann, 2nd edition, 2005.
107
Author Index
108