Data Mining For Business Applications
Data Mining For Business Applications
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Frontiers in Artificial Intelligence and
Applications
FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of
monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA
series contains several sub-series, including “Information Modelling and Knowledge Bases” and
“Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the
European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the
European Coordinating Committee on Artificial Intelligence – sponsored publications. An
editorial panel of internationally well-known scholars is appointed to provide a high quality
selection.
Series Editors:
J. Breuker, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras,
R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong
Volume 218
Recently published in this series
Vol. 217. H. Fujita (Ed.), New Trends in Software Methodologies, Tools and Techniques –
Proceedings of the 9th SoMeT_10
Vol. 216. P. Baroni, F. Cerutti, M. Giacomin and G.R. Simari (Eds.), Computational Models of
Argument – Proceedings of COMMA 2010
Vol. 215. H. Coelho, R. Studer and M. Wooldridge (Eds.), ECAI 2010 – 19th European
Conference on Artificial Intelligence
Vol. 214. I.-O. Stathopoulou and G.A. Tsihrintzis, Visual Affect Recognition
Vol. 213. L. Obrst, T. Janssen and W. Ceusters (Eds.), Ontologies and Semantic Technologies
for Intelligence
Vol. 212. A. Respício et al. (Eds.), Bridging the Socio-Technical Gap in Decision Support
Systems – Challenges for the Next Decade
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Vol. 211. J.I. da Silva Filho, G. Lambert-Torres and J.M. Abe, Uncertainty Treatment Using
Paraconsistent Logic – Introducing Paraconsistent Artificial Neural Networks
Vol. 210. O. Kutz et al. (Eds.), Modular Ontologies – Proceedings of the Fourth International
Workshop (WoMO 2010)
Vol. 209. A. Galton and R. Mizoguchi (Eds.), Formal Ontology in Information Systems –
Proceedings of the Sixth International Conference (FOIS 2010)
Vol. 208. G.L. Pozzato, Conditional and Preferential Logics: Proof Methods and Theorem
Proving
Vol. 207. A. Bifet, Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data
Streams
Vol. 206. T. Welzer Družovec et al. (Eds.), Information Modelling and Knowledge Bases XXI
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business
Applications
Edited by
Carlos Soares
LIAAD-INESC Porto L.A./Faculdade de Economia, Universidade do Porto,
Portugal
and
Rayid Ghani
Accenture Technology Labs, U.S.A.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
© 2010 The authors and IOS Press.
All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, without prior written permission from the publisher.
Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: [email protected]
LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications v
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
Preface
The field of data mining is currently experiencing a very dynamic period. It has reached
a level of maturity that has enabled it to be incorporated in IT systems and business
processes of companies across a wide range of industries. Information technology and
E-commerce companies such as Amazon, Google, Yahoo, Microsoft, IBM, HP and Ac-
centure, are naturally at the forefront of these developments. In addition, data mining
technologies are also getting well established in other industries and government sectors,
such as health, retail, automotive, finance, telecom and insurance, as part of large corpo-
rations such as Siemens, Daimler, Walmart, Washington Mutual, Progressive Insurance,
Portugal Telecom as well as in governments across the world.
As data mining becomes a mainstream technology in businesses, data mining re-
search has been experiencing explosive growth. In addition to well established applica-
tion areas such as targeted marketing, customer churn, and market basket analysis, we
are witnessing a wide range of new application areas, such as social media, social net-
works, and sensor networks. In addition, more traditional industries and business pro-
cesses, such as healthcare, manufacturing, customer relationship management and mar-
keting are also applying data mining technologies in new and interesting ways. These
areas pose new challenges both in terms of the nature of the data available (e.g., complex
and dynamic data structures) as well as in terms of the underlying supporting technology
(e.g., low-resource devices). These challenges can sometimes be tackled by adapting ex-
isting algorithms but at other times need new classes of techniques. This can be observed
by looking at the topics being covered at existing major data mining conferences and
journals as well as by the introduction of new ones.
A major reason behind the success of the data mining field has been the healthy
relationship between the research and the business worlds. This relationship is strong
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
in many companies where researchers and domain experts collaborate to solve practical
business problems. Many of the companies that integrate data mining into their products
and business processes also employ some of the best researchers and practitioners in the
field. Some of the most successful recent data mining companies have also been started
by distinguished researchers. Even researchers in universities are getting more connected
with businesses and are getting exposed to business problems and real data. Often, new
breakthroughs in data mining research have been motivated by the needs and constraints
of practical business problems. This can be observed at data mining scientific confer-
ences, where companies are participating very actively and there is a lot of interaction
between academia and industry.
As part of our (small) contribution to strengthen the collaboration between compa-
nies and universities in data mining, we have been helping organize a series of workshops
on Data Mining for Business Applications, with major conferences in the field:
• “Data Mining for Business” workshop, with ECML/PKDD, organized by Car-
los Soares, Luís Moniz (SAS Portugal) and Catarina Duarte (SAS Portugal),
which was held in Porto, Portugal, in 2005 (http://www.liaad.up.pt/
dmbiz/).
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
vi
important industries, namely banking, government, energy and healthcare. The issues
addressed in these papers include important aspects such as how to incorporate domain-
specific knowledge in the development of data mining systems and the integration of
data mining technology in larger systems that aim to support core business processes.
The applications in this book clearly show that data mining projects must not be regarded
as independent efforts. They need to be integrated into larger systems to align with the
goals of the organization and those of its customers and partners. Additionally, the out-
put of data mining components must, in most cases, be integrated into the IT systems
of the business and, therefore, in its (decision-making) processes, sometimes as part of
decision-support systems (DSS).
The chapters in Part 3 are devoted to emerging applications of data mining. These
chapters discuss the application of novel methods that deal with complex data like social
networks and spatial data, to explore new opportunities in domains such as criminology
and marketing intelligence. These chapters illustrate some of the exciting developments
going on in the field and identify some of the most challenging opportunities. They stress
the need for researchers to keep up with emerging business problems, identify potential
applications and develop suitable solutions. They also show that companies must not only
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
vii
pay attention to the latest developments in research but also continuously challenge the
research community with new problems. We believe that the flow of new and interesting
applications will continue for many years and drive the research community to come up
with exciting and useful data mining methods.
This book presents a collection of contributions that illustrates the importance of
maintaining close contact between data mining researchers and practitioners. For re-
searchers, it is essential to be exposed to and motivated by real problems and understand
how business problems not only provide interesting challenges but also practical con-
straints which must be taken into account in order for their work to have high practical
impact. For practitioners, it is not only important to be aware of the latest technology
developments in data mining, but also to have continuous interactions with the research
community to identify new opportunities to apply existing technologies and also provide
the motivation to develop new ones.
We believe that this book will be interesting not only for data mining researchers
and practitioners that are looking for new research and business opportunities in DM, but
also for students who wish to get a better understanding of the practical issues involved
in building data mining systems and find further research directions. We hope that our
readers will find this book useful.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
ix
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
xi
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
xiii
Contents
Preface v
Carlos Soares and Rayid Ghani
Members of the Program Committees of the DMBiz Workshops ix
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 1
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-1
Abstract. This chapter introduces the volume on Data Mining (DM) for Business
Applications. The chapters in this book provide an overview of some of the ma-
jor advances in the field, namely in terms of methodology and applications, both
traditional and emerging. In this introductory paper, we provide a context for the
rest of the book. The framework for discussing the contents of the book is the DM
methodology, which is suitable both to organize and relate the diverse contributions
of the chapters selected. The chapter closes with an overview of the chapters in the
book to guide the reader.
Preamble
E-mail: [email protected].
2 An overview of scientific and engineering applications is given in [1].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
2 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
customer churn, and market basket analysis, we are witnessing a wide range of new ap-
plication areas, such as social media, social networks, and sensor networks. In addition,
more traditional industries and business processes, such as health care, manufacturing,
customer relationship management and marketing are also applying data mining tech-
nologies in new and interesting ways. These areas pose new challenges both in terms
of the nature of the data available (e.g., complex and dynamic data structures) as well
as in terms of the underlying supporting technology (e.g., low-resource devices). These
challenges can sometimes be tackled by adapting existing algorithms but at other times
need new classes of techniques.
A major reason behind the success of the data mining field has been the healthy
relationship between the research and the business worlds. This relationship is strong
in many companies where researchers and domain experts collaborate to solve practical
business problems. On the one hand, business problems are driving new research (e.g.,
the Netflix prize3 and DM competitions such as the KDD CUP4 ). On the other hand, re-
search advances are finding applicability in real world applications (e.g., support vector
machines in Computational Biology5 ). Many of the companies that integrate data min-
ing into their products and business processes also employ some of the best researchers
and practitioners in the field. Some of the most successful recent data mining companies
have also been started by distinguished researchers. Researchers in universities are get-
ting more connected with businesses and are getting exposed to business problems and
real data. Often, new breakthroughs in data mining research have been motivated by the
needs and constraints of practical business problems. Data Mining conferences, such as
KDD, ICDM, SDM, PKDD and PAKDD, play an important role in the interaction be-
tween researchers and practitioners. Companies are participating very actively in these
conferences, both by providing sponsorship as well as attendees.
This healthy relationship between academia and industry does not mean that there
are no issues left to be solved when building data mining solutions. From a purely tech-
nical perspective, plenty of algorithms, tools and knowledge is available to develop good
quality DM models. However, despite the amount of available information (e.g., books,
papers and web pages) about DM, some of the most practical aspects are not sufficiently
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
documented. These aspects include data preparation (e.g., cleaning and transformation),
adapting of existing methods to solve a new application, combination of different types
of methods (e.g., clustering and classification), incorporation of domain knowledge into
data mining systems, usability of data mining systems, ease of deployment, and testing
and integration of the DM solution with the Information System (IS) of the company.
Not only do these issues account for a large proportion of the effort spent in a DM project
but they often determine its success or failure [2].
A series of workshops have been organized to enable the presentation of work that
addresses some of these concerns.6 These workshops were organized together with some
of the most important DM conferences:
• “Data Mining for Business” workshop, with ECML/PKDD, organized by Car-
los Soares, Luís Moniz (SAS Portugal) and Catarina Duarte (SAS Portugal),
3 http://www.netflixprize.com
4 http://www.sigkdd.org/kddcup/index.php
5 http://www.support-vector.net/bioinformatics.html
6 http://www.liaad.up.pt/dmbiz
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 3
Methodologies, such as CRISP-DM [4], typically organize DM projects into the follow-
ing six steps (Figure 1): business understanding, data understanding, data preparation,
modeling, evaluation and deployment. In the following, we briefly present how the chap-
ters in this book address relevant issues for each of those steps.
In the business understanding step, the goal is to clarify the business objectives for the
project. The second step, data understanding, consists of identifying sources, collecting
and becoming familiar with the data available for the project.
A very important issue is the scope of the project. It is necessary to identify a busi-
ness problem rather than a DM problem and develop a solution which combines DM ap-
proaches with others, where and whenever necessary. Some of the chapters in this book
illustrate this concern quite well. Rodrigues et al. address the problem of recommending
the most suitable tariff for electricity consumers [5]. The system proposed combines DM
7 http://www.liaad.up.pt/dmbiz/
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
4 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
Figure 1. The Data Mining Process, according to the CRISP-DM methodology (image obtained from
http://www.crisp-dm.org)
with more traditional decision support systems (DSS) technologies, such as a knowledge
base, a database and an inference engine. In another chapter, Ceglar et al. describe a tool
to support patient costing in health units [6]. The authors wrap several DM models in a
tool for users who are not DM experts. In the corporate radar proposed by Yeh and Kass
the goal is to automatically monitor the web and select news stories that are relevant to a
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
business [7]. Technologies from diverse areas are combined, including text mining, nat-
ural language processing (NLP), semantic models, inference engines and web sensors.
The tool offers a set of dashboards to present the selected information to the business
users.
To view the project in context, it is necessary to identify the stakeholders. This is
discussed by Puuronen and Pechenizkiy in the context of DM research [8]. The needs
of those stakeholders, in particular the end users, must be understood. Several chapters
discuss this issue in different contexts, including Blumenstock et al. for the automobile
industry [9], Domingos and van de Merckt for the financial services industry [10] and
Bruckhaus and Guthrie for the semiconductor industry [11].
It is also essential to clearly define the goals of the project. This should be done in
terms of the business as well as of the data mining methods. Domingos and van de Mer-
ckt establish the profitability of customers as their business goal, while a typical accuracy
measure is used to assess the algorithms [10]. Whenever possible, goals should be quan-
tified. In the error detection application described by Torgo and Soares, the customer de-
fined thresholds for maximum effort and minimum results, which should be respected in
order for the project to be considered successful [12]. Sometimes it is necessary for the
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 5
project team to clarify the definition of concepts which could, at first sight, be considered
trivial. This is the case of the concept of churn in the churn prediction task tackled by
Mutanen et al. for a retail bank [13].
In some cases, there are some constraints associated with the process that affect the
DM effort and, thus, should be identified as soon as possible. One example is the limit in
the amount of resources available for follow-up activities in the application described by
Torgo and Soares, that changes over time [12].
Understanding the data and their sources is an increasingly important step due to the
growing volume, diversity and complexity of data. In their chapter, Jank and Shmueli
propose a system to process the huge amounts of complex data from online eBay auctions
online [14]. Another example is the system proposed by Yeh and Kass that processes
news from diverse sources on the web, which are then combined with semantic models
of the application [7]. The chapter by Körner et al. describes several applications that
combine spatial, time and socio-demographic data, which is quickly becoming a common
scenario [15].
Finally, an issue that is very rarely addressed is the assessment of the costs of the
DM activity. A few examples, such as the cost of gathering data and the costs of the
errors are discussed by Puuronen and Pechenizkiy [8].
Data preparation consists of a diverse set of operations to clean and transform the data
so as to prepare it for the following modeling step [16].
The chapters by Blumenstock et al. and Rodrigues et al. illustrate some of the most
common kinds of problems that data may contain: imbalanced classes, inconsistent data,
outlier and missing values [9,5]. Some problems can be addressed with generic methods,
which do not depend on the application (e.g., replacing missing values with the mean
value of the attribute). In other cases, the correction uses domain knowledge (e.g., replac-
ing missing values with a special value). Blumenstock et al. find that simple operations
such as discretization can be very important to produce models for users who are not
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
6 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
1.3. Modeling
In the modeling step, the data resulting from the application of the previous steps is
analyzed to extract the knowledge that will be used to address the business problem.
In some applications, domain-dependent knowledge is integrated into the DM pro-
cess in all steps except this one, in which off-the-shelf methods/tools are applied. In this
volume, Mutanen et al. used logistic regression to identify bank customers that are likely
to churn [13]. Besides having been often reported as obtaining good results, this algo-
rithm is also used often because it generates models that can be understood by many end
users.
Sometimes, the results obtained with a single method are not satisfactory and better
solutions can be obtained with a combination of multiple methods. For instance, the sys-
tem proposed by Rodrigues et al. for electricity tariff recommendation includes cluster-
ing and classification modules [5]. In both of them, they use common algorithms, namely
k-means and Self-Organizing Maps (SOM) for clustering and decision trees for classifi-
cation. Another example is the system described by Domingos and van de Merckt, which
combines sequences of methods (including data preparation and modeling methods) to
develop a large number of models [10]. The methods are selected based on best practices
according to the experience of the authors. The issue of dealing with a very large num-
ber of models is becoming increasingly popular in DM, leading to what has been called
extreme data mining [17].
A different modeling approach consists of developing/adapting general methods for
a specific application, taking into account its peculiarities. In some cases, the applica-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
tion is not really new but has specific characteristics that require existing methods to be
adapted. In the chapter by Torgo and Soares, a hierarchical clustering algorithm is used
[12]. The use of clustering algorithms for outlier detection is not new. However, due to
the nature of the application, the algorithm was changed such that a ranking of the obser-
vations is generated, rather than a simple selection of the ones which are potential errors.
This makes it possible for the domain expert to use the output of the method in different
ways depending on the available resources.
Some applications involve novel tasks that require the development of new methods,
sometimes incorporating important amounts of domain knowledge. Some chapters in
this book describe new methods motivated by the underlying application, for the health
industry [6], criminology [18] and price prediction in eBay auctions [14]. In the chapter
by Körner et al. the methods are customized to deal with spatial data [15]. The com-
plexity of the data is such that the methods described by the authors incorporate prepro-
cessing mechanisms together with the model building ones. In the applications for the
automotive industry described by Blumenstock et al., the requirements for interactivity
are so strong, that new algorithms are proposed that can incorporate decisions made by
the users during the building of the models, which are called interactive algorithms [9].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 7
A data analyst must also be prepared to use methods for different DM tasks and orig-
inating from different fields, as they may be necessary in different applications, some-
times combined as previously described. The applications described in this book illus-
trate this quite well, including some tasks which are not so common in DM applications.
They include clustering (e.g., [5,6,18]), classification (e.g., [9,10,13,5]), regression (e.g.,
[15]), quantile regression (e.g., [10]), outlier detection (e.g., [12,6]), subgroup discov-
ery (e.g., [9,15]), time series analysis (e.g., [6,14]), visual analytics (e.g., [6,15]) and
information retrieval and extraction (e.g., [7]).
Additionally, as previously stated, to build complete solutions for business prob-
lems, it is often necessary to combine DM techniques with others. These can be tech-
niques from artificial intelligence or related fields. In the application by Rodrigues et al.
the DM models are integrated into a decision support system (DSS) that also incorpo-
rates a knowledge base, a database and an inference engine [5]. In the corporate radar
proposed by Yeh and Kass, very diverse technologies are used, including text mining,
natural language processing (NLP), semantic models, inference engines and web sensors
[7]. In some cases, the solution may combine DM with more traditional techniques. In
the targeting application for the financial services industry described by Domingos and
van de Merckt, a manual segmentation of customers is carried out before applying the
DM methods [10]. In summary, the wider the range of tools that is mastered by a data
analyst (or the team working on the project), the better the results that can be obtained.
Some of the papers in this volume also discuss the importance of tools. Domingos
and van de Merckt observe that most of the DM tools available on the market are work-
benches that end up being too flexible [10]. This leaves the developer with many tech-
nical decisions to be made during the modeling phase when the focus should be on the
business issues. The tool developed by Ceglar et al. for patient costing addresses this
issue [6]. On the one hand, it is tuned for that particular application. On the other hand,
it leaves some room for the user to explore different methods. Blumenstock et al. pro-
vide a different perspective, arguing that the tools should be interactive during the model
building phase, to enable the domain expert to contribute with his/her knowledge during
the process [9].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
1.4. Evaluation
The goal of the evaluation step is to assess the adequacy of the knowledge obtained
according to the project objectives.
For a DM project to be successful, the criteria selected to evaluate the knowledge
obtained in the modeling phase must be aligned with the business goals. In some cases,
it is possible to find measures that the experts can relate to. A few examples can be found
in this book, with lift [9,13] and recall [9]. Torgo and Soares present an unusual case,
where the experts established goals in terms of two measures that are common in DM:
recall and percentage of selected transactions [12].
Very often, however, the measures that are commonly used to assess DM models do
not represent the business goals well. The chapter by Bruckhaus and Guthrie discusses
how evaluation can be made by customers, not only at the model level but at the individ-
ual decision level [11]. The authors give a few examples of how business goals can be
translated to technical DM goals. As discussed in some chapters in this book, visualiza-
tion tools are also very useful to present DM results to domain experts in a way that is
easy for them to understand (e.g., [6,18,15]).
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
8 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
Furthermore, most of the time, the evaluation of a DM system is not based on a single
but rather on multiple criteria. Criteria of interest usually include technical measures
as well as business-related ones (e.g., the cost of making an incorrect prediction). The
chapter by Puuronen and Pechenizkiy describes a framework that allows researchers to
take these considerations into account [8].
In many situations, the users not only require a model that achieves the goals of the
business in terms of a suitable measure (or measures) but they also need to understand
the knowledge represented by that model. In this case, the data must describe concepts
that are familiar to the users (e.g., [10]) represented in a way that they understand (e.g.,
by discretizing continuous attributes [9,5]). Additionally, the algorithm must generate
models in a language that is also understandable by the users (e.g., decision trees in
the automotive industry [9] and logistic regression in the financial industry [13]). The
interactive algorithms proposed by Blumenstock et al. also contribute to evaluation in an
interesting way [9]. Given that the users interactively participate in the building of the
model, they are, thus, committing to the quality of the result.
Other tools that can be helpful for the evaluation are simulation (e.g., [5]), compari-
son with an existing theory (e.g., [18]) or with the decisions made by humans (e.g., [7])
and the use of satisfaction surveys (e.g., [7]).
1.5. Deployment
Deployment is the step in which the solution developed in the project, after being prop-
erly tested, is integrated into the (decision-making) processes of the organization.
Despite being critical for the success of a DM project, this step is often not given
sufficient importance, contrarily to other steps such as business understanding and data
preparation. This attitude is illustrated quite well in the CRISP-DM guide [4]:
In many cases it is the customer, and not the data analyst, who carries out the deployment
steps. However, even if the analyst will not carry out the deployment effort it is important for
the customer to understand up front what actions need to be carried out in order to actually
make use of the created models.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
This graceful handing over of responsibilities of the deployment step by the data analyst
can be the cause for the failure of a DM project which, up to this step, has obtained
promising results. This is confirmed in some of the chapters in this book, which clearly
state that the development of adequate tools is important [9,6].
A very important issue when the tool is to be used by domain experts who are not
data miners is the interface. The system must be easy to use and present the results clearly
and using a language that is familiar to the user [9,5,7,15].
Another important aspect is caused by the changes in conditions when moving from
the development to the operational setting. DM projects are often developed with samples
of data and without concerns for learning and response times. However, when deploying
a DM solution, its scalability and efficiency should be considered carefully [14].
Given that the goal is to incorporate the result of the DM project into the company’s
business processes, it is usually necessary to integrate it with the information system
(IS). This can be done at different levels. On one end, it may be completely independent,
with data being moved from the IS to the system on a periodic basis. At the other end, we
have solutions that are tightly integrated into the IS, possibly requiring the system to be
reengineered. Most of the applications in this volume follow an intermediate approach,
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 9
where the DM solution is developed as a separate solution with its own user interface
and integration into the IS being achieved by the sharing of the database [9,5,6].
2. Overview
The chapters are organized into three groups. In Part 1, we present chapters that discuss
methodological issues. The chapters in Part 2 describe case studies in some of the com-
mon areas of application of DM. Finally, Part 3 contains chapters that address some in-
novative applications of DM technology. In the following sections we give an overview
of their content.
This part starts with a reflection by Blumenstock et al. on the extensive DM experience
at Daimler [9]. This company has been involved in the research and application of DM
technologies for over 20 years. This chapter uses a case study on warranty data analysis to
discuss some of the lessons learned during that period. They argue to focus on the needs
of the users and claim that the two most important principles to achieve are simplicity
and interactivity. In this work, they take this principle to an extreme. Besides integrating
the experts into the business understanding phase, the data understanding and preparation
phases, and the evaluation phase, they also use their domain knowledge explicitly in the
modeling phase: they propose interactive algorithms where the expert is asked to make
decisions during the model building process.
As the number of DM projects increases, together with the number of models de-
veloped for each task, companies are becoming interested in defining best practices to
reduce the effort without reducing the quality of the models. This problem is addressed
in the second chapter, focusing on the financial services sector [10]. The case study dis-
cussed is on the identification of sales leads in a B2B context. Prospective customers
are segmented and different models are generated for different segments. The authors
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
use a tool that embodies some of the best practices they developed. These best practices
support several of the phases of the DM process, such as data preparation, algorithm
selection, parameter tuning and evaluation.
Chapter 3 also addresses the problem of taking the requirements of users into ac-
count but the focus is not on DM projects [8]. The authors go back further to DM re-
search. They observe that in research, data are a resource for which benefits have been
widely publicized by the DM community but whose costs have been mostly ignored.
They argue that this problem can be addressed by taking a cost/benefit perspective in the
evaluation of DM, and propose a multicriteria, utility-based framework for that purpose.
This framework is flexible enough to be useful for users with different roles. Some of
the discussion on this chapter is based on an interesting parallel with the evolution of the
Information Systems (IS) field.
The last chapter in this part is an interesting contribution to the discussion on the
need to align general DM evaluation criteria with domain-specific criteria [2], by Bruck-
haus and Guthrie [11]. The authors argue that domain experts should be involved in the
evaluation of both models and individual predictions. The importance of using the lan-
guage of the domain both in the representation of the models and in the communication
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
10 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
between data miners and the domain experts is stressed. The ideas are discussed in the
context of a case study in the semiconductor industry.
The second part starts with a case study in one of the most popular application areas of
DM, Customer Relationship Manager (CRM), by Mutanen et al. [13]. In this particular
case, the paper addresses a churn prediction problem in a Finnish retail bank. The authors
use an algorithm which is often used in this type of problems, logistic regression. A
lot of attention is dedicated both to data preparation and model analysis and evaluation,
which is essential for a successful DM project on this domain. One particularly important
issue in churn prediction is the suitable definition of the churn variable. This variable is
frequently defined in such a way that, when churn is predicted, the customer is already
lost.
The case study described by Torgo and Soares in Chapter 7 is a good example of an
application in which the constraints of the application are deeply integrated in the DM
project [12]. The problem tackled is the detection of errors in data that are used by the
Portuguese institute of statistics to calculate foreign trade statistics. The constraints that
must be verified affect not only data preparation and evaluation but they also affect the
algorithm. The authors propose an error detection method based on hierarchical clus-
tering that is adapted to take into account that the resources available for inspection are
limited and change throughout time.
The application presented in Chapter 8 by Rodrigues et al. is on the energy industry,
an area with a growing number of opportunities for DM [5]. The chapter addresses the
problem of recommending the most appropriate tariff for energy consumers based on
their consumption profile. This chapter illustrates the integration of different DM meth-
ods with other techniques to build a decision support system (DSS) that will be used by
domain experts who are not knowledgeable about DM.
The application described in Chapter 9 by Ceglar et al.[6] is in healthcare, another
domain of major importance for DM. The authors describe a DM tool that is specific to
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
improve quality of care and resource management for hospitals. Although it is specific
for this domain and targeted at users who are not professional data miners, it gives the
users some freedom to explore the implemented algorithms. Being oriented towards non-
data mining experts, it focuses on simple communication with the users and has a strong
focus on visualization and model description techniques. This is also a good example of
an application that motivates new research. The tool implements data mining methods
that were developed to address some of the specificities of the domain. It is also a good
illustration of the collaboration between universities and companies and the authors also
discuss some of lessons learned.
Chapter 10 addresses an emerging application area for DM, criminology [18]. This is
essentially a descriptive DM application with the purpose of identifying criminal profiles
by Breitenbach et al. The specificities of the application, namely the need to obtain a
reliable description of profiles, lead to the development of a new clustering method. The
results also illustrate a very important contribution that DM techniques can make. Some
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 11
of the profiles identified by the method from the data are in contradiction with existing
theories on criminology research. This led to the need to further investigate their validity
and a potentially novel perspective of criminal profiles.
The second application in this part, by Jank and Shmueli, is in another promising
domain, social networks [14]. The problem tackled is the prediction of prices in online
auctions on eBay. The goal is to help users focus on the most potentially advantageous
auctions from the large number which may involve goods that are of interest to them.
This is a very challenging problem due to the complex and highly dynamic nature of
the data. A new method is proposed in which techniques from very diverse fields are
combined, including functional data analysis and econometrics, to make real-time fore-
casts. Besides a complex methodology, the solution incorporates a significant amount
of domain knowledge. The authors point out that to build a practical system incorporat-
ing this approach, two very important issues must be addressed, namely scalability and
efficiency.
The system described by Yeh and Kass in Chapter 12 addresses an essential problem
that companies face in today’s extremely competitive environment [7]: how to filter and
use external information that is relevant for its business. There are too many potential
sources of relevant information for a company to monitor all of them. The authors de-
scribe a tool to continuously monitor the web in search of important information for a
specific goal. The tool presents the information to users using dashboards for simplicity.
This is another example of the integration of different technologies, including text and
web mining, natural language processing and inference engines. These technologies are
combined with domain knowledge to build complex models of the context of a company.
Two prototype applications are described. One of them detects information that is used
to estimate the assessment of the maturity of emerging technologies. This can be used
by middle management to support decisions concerning which technology to invest in.
The second extracts business insights into the market from the web (e.g., use of new
materials by the competitors that enable a reduction in production costs), such as threats
and opportunities, which can be used for strategic decision making. The usability of the
tool is essential because it is used by managers who have little or no knowledge of data
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
12 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
veys) with time and socio-demographic data. This raises challenging issues which turns
this into one of the more interesting research areas in data mining.
3. Conclusions
Acknowledgments
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
We would like to start by acknowledging Pavel Brazdil, who suggested to Carlos S. the
organization of the first workshop, held together with ECML/PKDD.
We are also indebted to the colleagues who helped us organize the workshops which
have been the basis for this book: Luís Moniz, Catarina Duarte, Markus Ackermann,
Bettina Guidemann, Françoise Soulié-Fogelman, Katharina Probst and Patrick Gallinari.
We are also thankful to the members of the Program Committee for their timely and
thorough reviews, despite receiving more papers than promised, and for their comments,
which we believe were very useful to the authors.
We also wish to thank the organizations of the conferences who have hosted the
workshops: ECML/PKDD 2005,8 KDD 2006,9 ECML/PKDD 200610 and KDD 2008.11
We are very thankful to everybody who helped us to publicize the workshop, par-
ticularly Gregory Piatetsky-Shapiro (www.kdnuggets.com), Guo-Zheng Li (MLChina
8 http://www.liaad.up.pt/~ecmlpkdd05/
9 http://www.sigkdd.org/kdd2006/
10 http://www.ecmlpkdd2006.org/
11 http://www.sigkdd.org/kdd2008/
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Soares and R. Ghani / Data Mining for Business Applications: Introduction 13
Mailing List in China) and KMining (www.kmining.com). We are also thankful to Rita
Pacheco, from INESC Porto LA, for revising some of the chapters.
The support of several supporting institutions is also gratefully acknowledged:
SAS,12 SPSS,13 KXEN,14 Accenture15 and LIAAD-INESC Porto LA.16
The first author wishes to thank the financial support of the Faculdade de Econo-
mia do Porto and the following projects: Triana (POCTI/TRA/61001/2004/Triana),
Site-O-Matic (POSI/EIA-58367-2004), Oranki (PTDC/EIA/ 68322/2006) and Rank!
(PTDC/EIA/81178/2006), funded by (Fundação Ciência e Tecnologia) co-financed by
FEDER.
References
[1] Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and Raju R. Namburu. Data
Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Norwell, MA, USA,
2001.
[2] R. Kohavi and F. Provost. Applications of data mining to electronic commerce. Data Mining and
Knowledge Discovery, 6:5–10, 2001.
[3] Carlos Soares, Yonghong Peng, Jun Meng, Takashi Washio, and Zhi-Hua Zhou, editors. Applications
of Data Mining in E-Business and Finance, volume 177. IOS Press, Amsterdam, The Netherlands, The
Netherlands, 2008.
[4] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0:
Step-by-Step Data Mining Guide. SPSS, 2000.
[5] Fátima Rodrigues, Vera Figueiredo, and Zita Vale. An integrated system to support electricity tariff
contract definition. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications,
Frontiers in Artificial Intelligence and Applications, chapter 8. IOS Press, 2010.
[6] Aaron Ceglar, Richard Morrall, and John F. Roddick. Mining medical administrative data –the PKB
suite. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in
Artificial Intelligence and Applications, chapter 9. IOS Press, 2010.
[7] Peter Z. Yeh and Alex Kass. A technology platform to enable the building of corporate radar applications
that mine the web for business insight. In Carlos Soares and Rayid Ghani, editors, Data Mining for
Business Applications, Frontiers in Artificial Intelligence and Applications, chapter 12. IOS Press, 2010.
[8] Seppo Puuronen and Mykola Pechenizkiy. Towards the generic framework for utility considerations in
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
data mining research. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications,
Frontiers in Artificial Intelligence and Applications, chapter 4. IOS Press, 2010.
[9] Axel Blumenstock, Markus Mueller, Carsten Lanquillon, Steffen Kempe, Jochen Hipp, and Ruediger
Wirth. Interactivity closes the gap. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business
Applications, Frontiers in Artificial Intelligence and Applications, chapter 2. IOS Press, 2010.
[10] Raul Domingos and Thierry van de Merckt. Best practices for predictive analytics in B2B financial
services. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers
in Artificial Intelligence and Applications, chapter 3. IOS Press, 2010.
[11] Tilmann Bruckhaus and William Guthrie. Customer validation of commercial predictive models. In
Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial
Intelligence and Applications, chapter 5. IOS Press, 2010.
[12] Luís Torgo and Carlos Soares. Resource-bounded outlier detection using clustering methods. In Carlos
Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial Intelli-
gence and Applications, chapter 7. IOS Press, 2010.
12 http://www.sas.com/
13 http://www.spss.com/
14 http://www.kxen.com/
15 http://www.accenture.com/
16 http://www.liaad.up.pt/
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
14 C. Soares and R. Ghani / Data Mining for Business Applications: Introduction
[13] Teemu Mutanen, Sami Nousiainen, and Jussi Ahola. Customer churn prediction - a case study in retail
banking. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers
in Artificial Intelligence and Applications, chapter 6. IOS Press, 2010.
[14] Wolfgang Jank and Galit Shmueli. Forecasting online auctions using dynamic models. In Carlos Soares
and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial Intelligence and
Applications, chapter 11. IOS Press, 2010.
[15] Christine Körner, Dirk Hecker, Maike Krause-Traudes, Michael May, Simon Scheider, Daniel Schulz,
Hendrik Stange, and Stefan Wrobel. Spatial data mining in practice: Principles and case studies. In
Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial
Intelligence and Applications, chapter 13. IOS Press, 2010.
[16] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
[17] Françoise Soulié-Fogelman. Data mining in the real world: What do we need and what do we have? In
R. Ghani and C. Soares, editors, Proceedings of the Workshop on Data Mining for Business Applications,
pages 44–48, 2006.
[18] Markus Breitenbach, Tim Brennan, William Dieterich, and Greg Grudic. Clustering of adolescent crim-
inal offenders using psychological and criminological profiles. In Carlos Soares and Rayid Ghani, edi-
tors, Data Mining for Business Applications, Frontiers in Artificial Intelligence and Applications, chap-
ter 9. IOS Press, 2010.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Part 1
Data Mining Methodology
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 17
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-17
Abstract. After nearly two decades of data mining research there are many com-
mercial mining tools available, and a wide range of algorithms can be found in
literature. One might think there is a solution to most of the problems practition-
ers face. In our application of descriptive induction on warranty data, however, we
found a considerable gap between many standard solutions and our practical needs.
Confronted with challenging data and requirements such as understandability and
support of existing work flows, we tried many things that did not work, ending up
in simple solutions that do. We feel that the problems we faced are not so uncom-
mon, and would like to advocate that it is better to focus on simplicity—allowing
domain experts to bring in their knowledge—rather than on complex algorithms.
Interactivity and simplicity turn out to be key features to success.
1. Introduction
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
An air bellow bursts: This happens on one truck, on another it does not. Is this random
coincidence, or the result of some systematic weakness? Questions like these have ever
been keeping experts busy at Daimler’s After Sales Services. Detecting and fixing unex-
pected quality issues as early as possible is key to a continuous improvement of Daim-
ler’s top quality products and to ensure customer satisfaction.
This primary goal of quality enhancement entails several tasks to be solved:
• predicting upcoming quality issues as early as possible,
• explaining why some kind of quality issue occurs and feeding this information
back to engineering,
• isolating groups of vehicles that might be affected by a certain defect in future, so
as to make service actions more targeted and effective.
When working on these tasks, quality engineers get valuable insights from analyzing
warranty and production data. Systems for early warning, quality reporting, warranty
cost control, and root cause investigations build upon a quality warehouse which inte-
1 Contact author: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
18 A. Blumenstock et al. / Interactivity Closes the Gap
Our users are experts in the field of vehicle engineering, specialized on various subdo-
mains such as engine or electrical equipment. They keep track of what is going on in the
field, mainly by analyzing warranty data, and try to discover upcoming quality issues as
early as possible. As soon as they recognize a problem, they strive for finding out the
root cause in order to address it most accurately.
They have been doing these investigations successfully over years. Now, data mining
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
can help them to better meet the demands of fast reaction, well-founded insight and
targeted service. But any analysis support must fit into the users’ mindset, their language,
and their work flow.
The structure of the problems to be analyzed varies substantially. This task requires
inspection, exploration and understanding for every case anew. Ideally, the engineers
should be enabled to apply various exploration and analysis methods from a rich repos-
itory. And it is important that they do it themselves, because no one else could decide
quickly enough whether a certain clue is relevant and should be pursued, nor ask the
proper questions. Explaining strange phenomena requires both comprehensive and de-
tailed background knowledge.
Yet, the engineers are not data mining experts. They could make use of data mining
tools out of the box, but common data mining suites already require deeper understanding
of the methods. Further, the users are reluctant to accept any system-generated hypothesis
if the system cannot give exact details that justify this hypothesis. The bottom line is
that comprehensability and, again, interactivity are almost indispensable features of any
mining system in our field.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 19
Most of the data at hand is warranty data, providing information about diagnostics and
repairs at the dealerships. Further data is about vehicle production, configuration and
usage. All these sources are heterogeneous, and the data is not collected for the purpose
of causal analyses. This raises questions about reliability, appropriateness of scale, and
level of detail. Apart from these concerns, our data has some additional properties that
make it hard to analyze, including
Imbalanced classes: The class of interest, made up of all instances for which a certain
problem was reported, is very small compared to its contrast set. Often, the pro-
portion is far below 1 %.
Multiple causes: Sometimes, a single kind of problem report can be traced back to dif-
ferent causes that produced the same phenomenon. Therefore, on the entire data
set, even truly explanatory rules show only modest qualities in terms of statistical
measures.
Semi-labeledness: The counterpart of the positives is not truly negative. If there is a
warranty entry for some vehicle, it is (almost) sure that it indeed suffered the prob-
lem reported on. For any non-positive example, however, it is unclear whether it
carries problematic properties and may fail in near future.
High-dimensional space of influence variables: There are 1000s of variables, each be-
ing potentially relevant only for a specific subset of quality issues. Although the
feature space can be reduced tremendously by automatic and interactive feature
selection, many variables have to be considered artifacts that are irrelevant for a
specific analysis.
Influence variables interact strongly: Some quality issues do not occur until several
influences coincide. And, if an influence exists in the data, many other non-causal
variables follow, showing positive statistical dependence with the class as well.
True cause not in data: It is very unlikely that the actual cause to a quality issue is
among the configuration or production related variables that make up the data. Of-
ten, the engineer has to deduce abstract concepts like usage scenarios from vehicle
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
These properties make it rather impossible that a truly causal influence is found by an
automated process. Yet, these problems can be addressed by allowing users to bring in
their background knowledge. As long as this knowledge is very much case-specific and
thus difficult to formalize, there seems to be hardly any alternative to a setup of interactive
model-building.
Let us first have a theoretical look at the problem. We consider vehicles non-conforming
that are more likely to be affected by a specific quality issue than others. For any vehicle
we would like to be able to tell whether it is likely to encounter problems in the future.
However, the model should not only be predictive, but primarily help the engineer in
understanding and explaining the quality issue. The notion of causality plays a key role
in this process. Our goal is to come as close to the root cause of a quality issue as possible
by identifying the characteristics of non-conforming vehicles (Figure 1). Hence, data
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
20 A. Blumenstock et al. / Interactivity Closes the Gap
mining should not only help to reveal potentially useful influences the engineer would
have never thought of, but should also help to narrow down the root cause by eliminating
non-causal findings and thus rejecting hypotheses.
Edition=Avantgarde
Edition=Classic
Figure 1. The data mining task is to separate non-conforming vehicles that are likely to be affected by a specific
quality issue (×) from all other vehicles () by identifying the main characteristics of the non-conforming
ones. In the example, the fraction of non-conforming vehicles among the Classic edition is higher than among
Avantgarde vehicles.
Table 1. Example dataset: 2 700 in 60 000 vehicles are non-conforming vehicles with a DTC set.
No. Vehicles
Cruise Control Edition
with DTC
Yes Avantgarde 200 10 000
No Avantgarde 400 20 000
Yes Classic 1 400 20 000
No Classic 700 10 000
2 700 60 000
Table 2. The sales code Cruise Control seems to be related to the failure when looking at the whole dataset
2(a). If one analyses the subset of Avantgarde and Classic vehicles separately, Cruise Control does no longer
have an effect on the failure 2(b).
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
To motivate the importance of causality, let us have a look at the following illustrat-
ing example (Table 1). Assume that 2 700 in 60 000 vehicles are non-conforming vehicles
that are brought to dealerships because a lamp indicates a diagnostic trouble code (DTC).
Now the engineer compares the fault rates for vehicles equipped with cruise control to
those without (Table 2(a)). A binomial test would indicate a significant deviation of the
target share in the subgroup of vehicles with cruise control. However, if the engineer
calculates fault rates for both, Classic and Avantgarde vehicles, separately, cruise control
does no longer seem to be related to the issue (Table 2(b)). The variable cruise control
is conditionally independent of the class variable given the state of variable edition. The
primary influence is edition and cruise control only seems to be important as more Clas-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 21
sic edition vehicles are equipped with cruise control than Avantgarde vehicles. What is
worse is that the true cause is probably hidden behind the influence edition. Hence, our
goal is to come as close to the true cause as possible by suppressing findings that are
likely to be non-causal.
The rationale is, in a domain with thousands of strongly interrelated influence variables
there may be many descriptions that are statistically similar, but only some of them give
the crucial hint to the problem cause. How should a system distinguish them? Subgroup
description is thus required to provide any reasonable explanation as long as there is no
evidence that the finding is void or unjustified.
In short, rather than finding the best-fitting model or composing some statistically
optimized representation, the new challenge is to provide guidance to a user who inter-
actively explores a huge space of hypotheses. In the course of such analyses, he perma-
nently formulates his own hypotheses and wants the system to tell him what supports or
contradicts his assumptions, and what is worth considering beyond.
Subgroup discovery (and description) can be mapped to partitioning the instance set
into multiple decision tree leaves. Paths to leaves with a high share of the positive class
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
22 A. Blumenstock et al. / Interactivity Closes the Gap
require that the measure be able to identify interesting subgroups, i.e. single nodes of the
decision tree, and that it be comprehensible to the business users.
Most of the time, we deal with two-class problems: the positive class C = 1 versus
the contrasting rest C = 0 with the positive class attracting our attention as the interesting
concept. Hence, we may use the measure lift (the factor by which the positive class rate
P (C = 1 | A = a) in a given node A = a is higher than the positive class rate in the
root node P (C = 1 | ∅)):
P (C = 1 | A = a)
lift(A = a → C = 1) =
P (C = 1)
To complement the lift value of a tree node, we use the recall (the fraction covered) of
the positive class:
recall(A = a → C = 1) = P (A = a | C = 1)
Both lift and recall are readily understandable for the business users as they have imme-
diate analogies in their domain. Furthermore, note that as we are considering only the lift
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 23
15
wtLift isometrics
topLift
0% topRecall 100%
Figure 2. Quality space for the assessment of split attributes. Each dot represents an attribute, plotted over
recall (x axis) and lift (y axis) of the best (possibly clustered) child that would result. Dots are plotted bold if
there is no other dot that is better in both dimensions. The curves are isometrics according to the recall-weighted
lift (wtLift).
and recall values of the most interesting node rather than an average value over all nodes
resulting from a split as is done in general tree induction for classification tasks, we are
now able to focus on interesting subgroups.
By focusing on high-lift paths, the users can successively split tree nodes to reach a
lift as high as possible while maintaining nodes with substantial recall. Note that simply
choosing nodes with maximal lift does not suffice as this often results in nodes with only
very few instances which are obviously not helpful in explaining root causes despite their
high lift. Therefore, we always have to consider both lift and recall in combination.
In order to condense lift and recall values into a suitable attribute ranking, we have
derived a one-dimensional measure which we refer to as weighted lift or “explanational
power”:
1
wtLift (A = a → C = 1) = P (A = a | C = 1) 1 −
lift (A = a → C = 1)
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
This weighted lift is not intended to add to the tens of statistical quality measures that
already exist. Many other measures will do. The point is that while we need a scalar
measure to provide the attribute ranking, we always present the lift and recall values for
each attribute to the users, too, as these are the measures they understand.
As an alternative to a ranked list, the user can still get the more natural two-
dimensional presentation of the split attributes (Figure 2). Similar to plotting a ROC
space, every such attribute is drawn as a point while using recall and lift as the two
dimensions.
Using the weighted lift to rank attributes works well for symbolic attributes with
a small number of possible values. For attributes with large domains and for numeric
attributes we require some further processing to suit our interactive setting.
Let us consider symbolic attributes with large domains. There are two pitfalls in our
application domain. First, our measure is more sensitive with regard to skewed distri-
butions of attribute values as we consider only single nodes rather an averages. Second,
as we focus on interesting paths within the tree, any further nodes resulting from a split
only distracts attention and impairs understandability.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
24 A. Blumenstock et al. / Interactivity Closes the Gap
To mitigate these problems, we group attribute values (or, the current node’s chil-
dren). We require the resulting split to create at most k children, where typically k = 2 so
as to force binary splits. This ensures both that the split is “handy” and easily understood
by the user, and that the subsequent attribute ranking can be based consistently on the
child node with the highest lift.
C4.5 [5], for example, offers some grouping facility that merges pairs of children as
long as the gain ratio does not degrade. However, this algorithm of squared complexity is
neither necessary nor reasonable in our two-class world with a focus on high-lift paths,
the more so as in interactive use low response times are desirable.
To group the children in a reasonable and efficient way without any structural infor-
mation on the attribute domain, we automatically proceed as follows. Initially we sim-
ply sort the nodes resulting from a split by their lift values. Then, keeping their linear
order, we cluster them using several heuristics: first, to increase robustness of the ap-
proach, merge smallest nodes with their nearest neighbor with regard to lift values. Then
we continue in agglomerative clustering manner: merge adjacent nodes with lowest lift
difference until the desired number of nodes is reached.
Although the grouping of attribute values is performed automatically during attribute
assessment, the users may undo and redo the grouping interactively. They may even
arrange the attribute values into any form that they desire. This is important to further
incorporate background knowledge, e.g. with respect to ordered domains, geographical
regions, or, in particular, components that are used in certain subsets of vehicles and
should, thus, be considered together.
no. vehicles produced
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
May Jun Jul Aug Sep Oct Nov Dec Jan Feb
Figure 3. Example of a numeric split chart for the attribute BuildDate. The height of the bars indicates the
subgroup size, i.e. the number of vehicles produced in a specific month. The color of the bars encodes the lift
on a scale from green to red (here: grayscale).
Now, consider numeric attributes. In order to be used in a decision tree, the numeric
domain has to be discretized. For our interactive setting it is most adequate for the rea-
son of understandabilty to provide either binary or ternary splits. All we have to do is
adapting the method of dealing with numeric attributes deployed by any one of the com-
mon tree induction algorithms so that split point quality is assessed based on our mea-
sures with respect to the single most interesting node instead of the standard measures
averaging over all nodes of a split.
To enhance interactivity, the users may modify a resulting split by interactively
choosing their own split points. Again, this allows further incorporation of background
knowledge, which proved to be especially useful for date attributes, e.g. with respect to
known changes in product development or known clean points in the production pro-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 25
cess. Supporting split point selection by means of interactive diagrams which visualize
the relevant lift and recall values depending on the numeric attribute under consideration
(Figure 3) is of great value for the users. In addition, it is very helpful to immediately
show a preview of the resulting sub-tree while adjusting the split points.
Two split previews for the example data set in Table 1 are depicted in Figure 4.
Weighted lift ranks the variable Edition higher than the variable CruiseControl.
fault rate 2% fault rate 7% fault rate 3.7% fault rate 5.3%
issue 600 issue 2100 issue 1100 issue 1600
sum 30000 sum 30000 sum 30000 sum 30000
Figure 4. Two possible splits for the example data set in Table 1. The decision tree in Figure 4(a) is superior
to the split in Figure 4(b) in regards to lift and recall. The color of the node header encodes the lift.
3.1.3. Causality
Interactivity plays the key role in our approach and it is important that a model does
not maximize a statistically motivated scoring function, but the expert’s degree of belief
in the correctness of a hypothesis. Hence, an engineer could be tempted to pick cruise
control instead of edition for some reason in the example above. By looking ahead one
level as done in Figure 5, one can detect non-causality. Interactive look-ahead decision
trees based on the application of Bayesian partition models as described in [6] consider
causality in the attribute ranking. Moreover, taxonomies and partonomic structures are
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
exploited to make the model more accurate. Normalized mutual information is used to
measure and visualize attribute similarity during split attribute selection.
We mentioned the observation that in real life, influences sometimes interact in a way
that a quality issue does not occur until several influences coincide. While decision tree
building is intuitive, its search is greedy and thus may miss such interesting combina-
tions. So the experts asked for automatic, more comprehensive search. While data is a in-
herently fragmentary picture of reality, and no complete enumeration of patterns is guar-
anteed to find relevant influences, the user can be assured that he gets the most probably
useful patterns at least within the specified data range and search depth.
A second determinative observation is that the quality issue that defines the class
variable often traces back to several independent sub-phenomena, and that a lot of quasi-
noise exists. A typical model generation regime that tries to fit the entire data in the best
possible way will easily be mislead on such data. Building only partial models seems a
way out.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
26 A. Blumenstock et al. / Interactivity Closes the Gap
Figure 5. The variable CruiseControl is conditionally independent of the target variable given the variable
Edition.
Table 3. Rule set for the example data in Table 1. The consequence is fixed and thus omitted. Similiar rules,
i.e. rules that cover similiar instances, or potentially non-causal rules can be highlighted in our application.
Subgroup Coverage Recall Fault Rate wtLift
Edition=Classic 50% 78% 7% 0.28
Edition=Avantgarde 50% 22% 2% −0.28
CruiseControl=yes 50% 59% 5.3% 0.09
CruiseControl=no 50% 41% 3.7% −0.10
Edition=Classic ∧ CruiseControl=yes 33% 52% 7% 0.19
Edition=Classic ∧ CruiseControl=no 17% 26% 7% 0.09
Edition=Avantgarde ∧ CruiseControl=yes 17% 7% 2% −0.09
Edition=Avantgarde ∧ CruiseControl=no 33% 15% 2% −0.19
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 27
The next logical step was making search exhaustive (within constraints). It is real-
ized by an association rule miner with fixed consequence. For the data set in Table 1, Ta-
ble 3 shows such an exhaustive rule set. Like us, many have pursued this idea, many have
come across the problem of the sheer mass of (even significant) rules, and many research
groups thus investigated how to handle redundance within the results (e.g., [10,11]).
These algorithms suppress patterns that are syntactically or statistically similar to others
that remain. That way however, they re-introduce the problem we wanted to get solved
by exhaustiveness: accidentally, meaningless patterns may suppress those that are truly
causal or at least provide the crucial hint.
To solve the goal conflict of desirable exhaustiveness versus a prohibiting mass of
patterns, interactivity once again was a significant step forward. We started off with pre-
senting a ranked list of association rules to the expert and then enable him to control
a CN2-SD-like sequential covering regime. Ranking allows him to find among the first
dozens of rules at least one which he recognizes as “interesting” or “already known”. He
picks it, modifies the instance set so as to remove this influence, and re-iterates to find
the next interesting rule. In contrast to the automatic regime, it is him to decide about
what is removed, and the danger that he misses something is much smaller.
While this basic idea of ranking patterns like search engine results and re-ranking
them based on interactively chosen rules was simple and appealing, we learned that in-
teractivity alone is not the magic word: We also had to find modes of visualization and
feedback that ease the flow of information needed to make interactivity effective.
Though often considered understandable, association rules show deficiencies to this
end. For two variables already, like in Table 3, you can hardly tell what impact the indi-
vidual items (selectors) have within a pattern, or what it would be like if you exchange
a selector with another. In other words, a singular rule provides too fragmented a know-
ledge as that it could be reasonably understood and then selected for the interactive ex-
ploration scheme described above.
[12]. Rule cubes are named after the cubic structure that results from arraying those as-
sociation rules that belong to the same set of variables. They can equivalently be con-
sidered contingency tables. A pattern now is a set of one or more variables, stating that
these variables have some impact on the class variable.
The most obvious advantage over association rules is that rule cubes allow for quite
intuitive visualizations. Figure 6 shows an example, with lots of variants being possible.
That it is intuitive may be ascribed to that it actually does not display abstract patterns
but how the data distributes under this pattern, which is a very common mode of thinking
in statistics. Presenting entire attribute domains instead of singular values, these visual-
izations answer the aforementioned questions about neighboring rules and the roles of
the individual variables.
The idea of rule cubes turned out to be quite beneficial on the other aspects of our
interactive setup as well, namely ranking and feedback. Having the complete distribution
available, it is easier to implement a fairer ranking of joint influences by only consider-
ing their additional value over the superposition of their components. (This technique is
known from ANOVA or log-linear analysis.) Feeding back information that some influ-
ence is known and should be removed can now be implemented in a better way than by
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
28 A. Blumenstock et al. / Interactivity Closes the Gap
Cruise-
yes 5.3% no 3.7%
Edition Control
Classic 7% 7% 7%
Avantgarde 2% 2% 2%
Figure 6. Rule cubes provide an intuitive visualization for a two-dimensional contingency table. The size of
the tiles shows the number of covered instances while the color encodes the fault rate (lift) on a color scale.
Calculating the suppression strength reveals that CruiseControl is pushed by variable Edition.
Figure 6 shows the rule cube for the example data set from Table 1. The fact that
CruiseControl is irrelevant in the light of knowing the variable Edition (i.e., the first is
“suppressed by” the latter) becomes immediately apparent when looking at the conjoint
distribution.
We had a look at several commercially available data mining suites and tools. However,
none of these met the requirements outlined in Section 2.1.
As an overall observation, they were rather inaccessible and often did not allow for
interaction at the model building level. Even if they did, they could not present informa-
tion (like measures) in the non-statistician users’ language. Tools of this kind offer their
methods in a very generic fashion so that the typical domain expert does not know where
to start. In short, we believe that the goal conflict between flexibility and guidance can
hardly be solved by any general-purpose application, where the greatest simplification
potential, namely domain adaption, remains unexploited.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 29
Build Tree
Model
Prepare Explore
Data
Build Cube
Model
Figure 7. Coarse usage model of our tool. There is a fixed process skeleton corresponding to the original
workflow. The user can just go through, or gain more flexibility (and complexity) upon request.
user to a single process but allows going deeper and gain flexibility wherever the user is
able and willing to.
Usually, the users start with extracting data for further analysis. We tried to keep this
step simple and hide the complexities as much as possible. The user just selects the ve-
hicle subset and the influence variables he likes to work with. A meta data based system
cares about joins, aggregations, discretizations or other data transformation steps. This
kind of preprocessing is domain specific, but still flexible enough to adapt to changes and
extensions.
In the course of their analyses, the experts often want to derive variables of their own.
That way, they can materialize concepts otherwise spread over several other conditions.
This is an important point where they introduce case-specific background knowledge.
The system allows them to do so, up to the full expressiveness of mathematical formulas.
A similar fashion of multi-level complexity is offered for the “Explore” box in Fig-
ure 7: The system offers both standard reports, suiting the experts’ needs in most of the
cases, up to individually configurable diagrams. For the sake of model induction, our tool
offers currently three branches that interact and complement each other: decision trees,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
5. Case Study
In this section we present a real-world case study to give an idea of our system and its
overall value. Imagine the following scenario: Several vehicles are brought to dealerships
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
30 A. Blumenstock et al. / Interactivity Closes the Gap
because a lamp indicates an engine issue. Diagnostics information read from the engine
control module indicates some trouble with the exhaust system. Not knowing which part
exactly fails, dealers replace oxygen sensors on suspicion. Early warning systems for
warranty cost control show a significant increase in warranty claims for these sensors.
Quality engineers get alerted, but cannot find an explanation for the issue: The replaced
sensors are inspected and are okay. Yet, no other part seems to have failed.
The data analyst knows that only one engine type can set the fault code. Therefore he
restricts the data set to all instances with Opt_Engine=E. Now, the system shows a ranked
list of many possible influences (e.g. Opt_Emission, Mileage, BusinessCenter), confirm-
ing the engineer’s assumption that all service claims are related to the CARB states emis-
sions system (Figure 8).2 Note that although weighted lift ranks Opt_Emission highest,
other measures like information gain, gain ratio, or gini would not. The engineer’s prior
expectation that this variable is highly relevant, makes it the preferable choice. The re-
sulting tree is depicted in Figure 9.
After restricting the data set to vehicles with Opt_Emission=N the analyst expects
that State does no longer show up. However, State remains a high-ranked influence, and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
the north-eastern CARB states, especially New York, have extraordinarily high failure
rates. In our tool this is illustrated by a map that shows lift and recall for each state on a
color range from green to red (Figure 10(a)). Based on this surprising result, he derives
a new attribute CARBStates, that separates north-eastern carb states and California. The
analyst observes that most service events occur in CARB states although some vehicles
sold to CARB states run in other states. It is interesting that north-eastern CARB states still
show a much higher failure rate than California. A possible explanation for this could be
the stop-and-go traffic in New York.
Apart from the new attribute CARBStates, the attribute Min7Temp is ranked quite
high. The split preview in Figure 11 shows that the failure primarily occurs when the
minimum temperature within 1 week before repair date was low.
Now the question arises whether this is due to the fact that the minimum temperature
is lower in New York than in California, whether failures primarily occured during the
winter months or whether temperature is the true influence. The strength of rule cubes
2 The term CARB states (California Air Resources Board) refers to five US states that share very strict
emission laws: CA, NY, MA, ME, VT.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 31
Figure 10. Figure 10(a) shows the distribution of failures after restricting instances to Opt_Engine=E and
Opt_Emission=N on a U.S. map (dark colors indicate a high lift and a high recall). The chart in Figure 10(b)
illustrates that most vehicles with this engine type and the specific emission standard are driven in California,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Figure 11. Split preview for the variable Min7Temp. The temperature chart in Figure 11(a) and the decision
tree preview in Figure 11(b) indicate that failures are more likely to occur at low temperatures. In the chart, the
height of the bars visualizes the number of vehicles in total, while the bar color encodes the lift. The shaded
area highlights the split borders proposed by the split algorithm which can be adjusted manually.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
32 A. Blumenstock et al. / Interactivity Closes the Gap
Figure 12. Neighborhood of the influence CARBStates: In the three list boxes on the left, the tool suggests
potential causes, similar influences, and possibly caused influences, respectively. Min7Temp is listed among
the suggested causes. On selection of this variable the 2D-cube on the right provides details. It opposes the
CARBStates variable (horizontal axis) to the Min7Temp variable (vertical axis). Indeed, there are stronger
color (i.e., lift) differences over the Min7Temp axis than over the CARBStates axis in each row, and it is only
the statistical dependence (visible at the tile sizes) that makes the marginal at the bottom (the CARBStates
distribution) a seemingly strong influence.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Figure 13. Variable Min7Temp remains an influence after the elimination of variable CARBStates: There is a
strong color gradient from red (left, cold temperatures) to green (right, warm temperatures).
lies in their ability to help answering questions like these. As our tool allows exchanging
data and combining various data analysis methods, the engineer can apply rule cubes to
the data set. Now the engineer can examine the neighborhood of CARBStates as depicted
in Figure 12. The tool suggests that the high rank of CARBStates might have been caused
by weather conditions, e.g. variable Min7Temp. As temperatures in California are higher
than in the north-eastern states (as well visible in Figure 12) the failure just does not
occur that often there.
To verify this, the engineer eliminates the influence CARBStates. CARBStates obvi-
ously gets rank 0, but Min7Temp is still ranked quite high. If however the user eliminates
the influence Min7Temp the influence CARBStates almost vanishes (Figure 14). Note
that, despite this interaction the cube CARBStates × Min7Temp is ranked low (1.95),
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Blumenstock et al. / Interactivity Closes the Gap 33
Figure 14. Variable CARBStates is ranked quite low after the elimination of the influence Min7Temp; all three
tiles are in yellowish colors, representing lift values near 1.
since the main influence is the temperature, and there is no interaction of these two as to
cause the issue.
Based on these results the engineer actually finds out that there was a calibration
issue that sometimes caused the engine control module to set a diagnostic code when the
vehicle was driven at cold temperatures in wide open throttle mode.
6. Conclusion
tivity can close the gap: If a user is strongly involved in the process of building a model,
the resulting model will not only maximize a statistically motivated scoring function, but
the user’s personal importance score. Then, the posterior probability of the most likely
hypothesis also contains the user’s prior expectations.
Many approaches suggested in literature turned out either too constrained or too
complex to be offered without major adaption. In such a setting, we consider it best
to stick to simple methods, provide these in a both flexible and understandable way,
and settle on interactivity. We presented interactive decision trees, rule sets, and finally
interactive rule cubes as three data mining methods that fit our requirements best. A
special focus is on the notion of causality.
Commercially available tools proved too inflexible in regards to domain adaptation.
As our users are no data mining experts, the application workflow must follow the users’
workflows and not vice versa. There is a fixed process skeleton corresponding to the
original workflow and a regular user can just go through this process, while a power user
gains more flexibility (and complexity) upon request. A real world case study showed
how our tool can be applied in practice.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
34 A. Blumenstock et al. / Interactivity Closes the Gap
References
[1] David J. Hand. Data mining—reaching beyond statistics. Research in Official Statistics, 2:5–17, 1998.
[2] Willi Klösgen. Applications and research problems of subgroup mining. In Proceedings of the Eleventh
International Symposium on Foundations of Intelligent Systems, 1999.
[3] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regres-
sion Trees. Chapman & Hall, 1984.
[4] G. V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied
Statistics, 29:119–127, 1980.
[5] John Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[6] Markus Mueller, Christoph Schlieder, and Axel Blumenstock. Application of bayesian partition models
in warranty data analysis. In Proceedings of the Ninth SIAM International Conference on Data Mining
(SDM) (accepted), 2009.
[7] Willi Klösgen. EXPLORA: A multipattern and multistrategy discovery assistant. In Advances in Know-
ledge Discovery and Data Mining, pages 249–271. American Association for Artificial Intelligence,
Menlo Park, CA, USA, 1996.
[8] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First
European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97), 1997.
[9] Nada Lavrač, Peter A. Flach, Branko Kavšek, and Ljupčo Todorovski. Rule induction for subgroup
discovery with CN2-SD. In ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of
Data Mining, Decision Support and Meta-Learning, 2002.
[10] Bing Liu, Minqing Hu, and Wynne Hsu. Multi-level organization and summarization of the discovered
rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2000.
[11] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarizing itemset patterns: a profile-based
approach. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, 2005.
[12] Kaidi Zhao, Bing Liu, Jeffrey Benkler, and Weimin Xiao. Opportunity map: Identifying causes of failure
– a deployed data mining system. In Proceedings of the Twelfth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 892–901, 2006.
[13] Martin Scholz. Knowledge-based sampling for subgroup discovery. In Lecture Notes in Computer
Science, volume 3539, pages 171–189, 2005.
[14] Axel Blumenstock, Franz Schweiggert, Markus Müller, and Carsten Lanquillon. Rule cubes for causal
investigations. Knowledge and Information Systems, 2008.
[15] Wikimedia. Map of usa with state names. http://en.wikipedia.org/wiki/File:Map_of_
USA_with_state_names.svg, 2007. Last accessed: 2008-12-10.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 35
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-35
VADIS Consulting
Introduction
The work that served as the motivation for this paper consisted on the development of
systems to generate profitable sales leads for current customers and non-customers.
These leads are the output of predictive models that are created following
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
segmentations of the customers and non-customers database. Due to the large number
of predictive models often required to cover all the database segments, the application
of “best practices” is necessary to render the exercise feasible.
Such a system, enabling the identification and usage of best practices is the
object of the second part of the paper.
Predictive analytics has been used already for several years by corporations as a source
of competitive advantage. The pioneers in this domain have been B2C businesses with
a large base of consumers and the capacity to collect and store detailed socio-
demographical and transactional data from their customers. Most of these corporations
have applied predictive analytics to learn about their customers but only a few have
done it to learn about their non-customers due to the difficulty of capturing reliable
data about non-customers. The proliferation of data providers and the evolution of the
1
Corresponding Author : Raul Domingos, VADIS Consulting, Researchdreef 65 Allée de la
Recherche, 1070 Anderlecht, Belgium ; E-mail : [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
36 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
technical infrastructure to manage huge amounts of data may turn things easier but
concerns about data privacy can be a major barrier to exploit external data sources
about consumers, especially in Europe where data privacy regulation is more
constraining than elsewhere [1].
This part of the article describes the business design of how predictive analytics
can be used to generate the most profitable sales leads for each of both non-customers
and customers of a financial services provider in a B2B setting. The business offer
available in this business context consists of products and services to attain a
diversified set of goals such as managing receivables, controlling financial risks,
optimize the treasury, secure and finance international business or transform the
company capital structure.
The factors that determine the business lead profitability are relatively different
from non-customers to customers. The first section describes in detail the factors that
influence the potential profitability of a business lead for non-customers. The second
section describes in detail what changes for each factor from non-customers to
customers.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 37
The hypothesis that the non-customer may already have the offer with some
competitor has to be taken into account with the existence (or not) for needing even
more of that offer. This is equivalent to say that the competitor may have an up-sell
opportunity. The commercial interest of the situations depicted in the previous matrix is
not the same for all financial services.
A “Customer Migration” illustrates the scenario when the non-customer already
has the financial service with some competitor and does not have a need for more of
that offer. The challenge to the sales representative is to convince the non-customer
that it is interesting to migrate the financial service from the competitor to the sales
representative Bank. There is only a realistic commercial opportunity if there are no big
business barriers to that migration. For instance, the migration of a short term credit is
easier than of a long term credit.
A “Competitive Opportunity” illustrates the scenario when the non-customer
already has the financial service with some competitor but still has the need for more of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
that offer. The sales representative will want to convince the non-customer to migrate
the financial service to the sales representative Bank, like in the previous case, but now
together with an up-sell opportunity.
An “Unattended Opportunity” is probably the most interesting scenario for a sales
representative. The non-customer has a need for a new financial service.
If the non-customer has neither the service nor the need of acquiring it, there is
indeed “No Opportunity”.
To estimate the location of each non-customer in the matrix of commercial
scenarios, there are two modelling approaches that will help to align the non-customer
accordingly to each one of the matrix dimensions:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
38 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
of a P2B allows discriminating between the left and the right part of the
matrix.
Both models together will allow to spot opportunities in the all matrix.
To estimate that a company already has (or not) a financial service in a specific
period is always possible using information like the company financial data. For
instance, companies must declare explicitly the liabilities they have towards financial
institutions in their balance sheets. However, the prediction about the need that a
company may have for some financial service is not always viable. Since this
prediction is about a specific moment in the company life time, the concept is not
applicable for financial services that are recurrently acquired by a company without the
opportunity for upgrade or up-sell (e.g. tax credit). The condition to apply this
dimension remains in the ability to identify 2 consecutive periods in the company life
time, where the company doesn’t have the service in the first period and has the service
in the second period (e.g. long term credit). Alternatively, the company may already
have the service in the first period but acquires more of the same service in the second
period. These concepts are illustrated in figure 2 “Time Windows”.
If the data used in the predictive exercise is on a yearly basis (like the company
financial data), these periods have to be yearly as well.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
For each financial service, the following questions can be used to determine which
modeling approaches (P2H; P2B) to calculate:
• Are there high business migration barriers to change the financial service
between two banks?
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 39
In a scenario of “Customer Migration”, the value estimation from the overall new
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
customer relation is more important to select the best non-customers to “steal” from the
competition. This normally requires higher investments in the relationship that must be
justified with high expectations of future value generation.
In a scenario of “Unattended Opportunity”, the leads prioritization should be based
on the value estimation of the offer since the acquisition is based on an existing need
still not addressed by the competition. The value linked to this current need should be
easier to fulfill than any other value expectation dependent of a positive development
from a future business relationship.
In a scenario of “Competitive Opportunity” both value estimations should be taken
into consideration.
There are at least two approaches to define the business value to be estimated:
• The value that the new customer will generate is based on the sales
representative Bank best practices. This means that the commercial modus
operandi stays the same.
• The value that the new customer will generate is based on “perfect” practices.
This is equivalent to the total customer wallet estimation.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
40 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
If the Bank doesn’t intend to dramatically improve its commercial best practices,
than the first approach is advised for non-customers. The major technical differences of
implementing one or another approach are further explained in the second part of this
article.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 41
companies that acquired some financial service in the past but for the last 12 months
have not engaged in any kind of interaction with their Bank and the business value
generated was null. These customers can indeed be considered just like any non-
customer for commercial “hunting” purposes. Of course, the sales representative
should be aware about the existing history with that company when approaching it
again.
Active customers that generate a negative value must be further analyzed to
understand the reasons that drive them to be unprofitable. Those customers that are
generating losses due to costs associated to default of credit payments or any other kind
of delinquency behavior should be excluded from any marketing approach to maximize
customer profitability. Those customers that generate losses because of administrative
costs (e.g. intensive usage of costly channels such as branches for low value
operations) should be considered an opportunity for business value improvement just
like any other active customer that already generates a positive value.
Just as for non-customers, there are the same two possible approaches to define the
business value to be estimated. They are repeated next:
• The value that the new customer will generate is based on the sales
representative Bank best practices. This means that the commercial modus
operandi stays the same.
• The value that the new customer will generate is based on “perfect” practices.
This is equivalent to the total customer wallet estimation.
The creation of a value estimation based on the first approach is much less relevant
for customers. The best indicator of the future value that financial services customer
will generate based on “business as usual” is the past value generated by that customer.
The optimization of the profitability of customers is achieved by identifying those
customers where there is a potential to tap. The “Growth Value Matrix” in figure 5
illustrates a customer segmentation based on the current value and the growth potential.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
42 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
The growth potential can be defined as the customer wallet share that is not
captured by the Bank ((total customer wallet – current value) / current value). The total
customer wallet estimation is the value model proposed for customers.
• Attrition Score: What is the expected life time of the business relationship
with the Bank?
The estimated profitability of sales leads is only achieved if the customer stays
around enough time doing business with the Bank. Despite the interesting profiles that
some customers may have in terms of offer acceptance, business value estimation, and
risk of bankruptcy, care must be taken not to invest too much into business relations
that risk to be terminated too soon by the customer.
Customers may terminate their relationship with the Bank either explicitly
(accounts and other products are closed) or silently (accounts and other products are no
longer used or are used to a minimum). Both types of attrition can be foreseen using
predictive models.
The knowledge about which customer relations have a high risk of attrition allows
us to act proactively instead of reacting when it may be too late to keep the customer
actively in the portfolio.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 43
1.3. Summary
All the factors associated to the profitability of the sales leads can be fairly well
estimated using the companies’ financial information which is public data. This is
much more detailed information than the information that any commercial company
will be ever able to collect from personal consumers if life privacy laws are not
loosened up. The publicly data available about companies is the equivalent of knowing
the assets, liabilities, main sources of income, main expenses types and family
composition of personal consumers!
In the previous sections, the most important distinctive points on how to manage
non-customers and customers to maximize profitability were identified. The summary
of these distinctive characteristics is presented in table 1. “Customers versus Non
Customers”.
The translation of the business design just described into a solution based on
predictive analytics requires the implementation of many models. The implementation
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
and management of all these models require the adoption of effective methodologies
supported on best practices. The identification and usage of these best practices is the
subject of the next part of this paper.
By reading a sample of reports about predictive analytics projects, one realizes that the
practice of predictive analytics is probably one of the domains with more variations on
how professionals solve similar problems. This is positive for problems for which best
practices are unknown but can be a factor of inefficacity for well known problems.
In the context of this paper, the concept of “best practices” means the set of
knowledge that is applicable to most instances of the same problem most of the time
and its application can enhance the chances of successfully solving those problems.
We can speculate about some of the reasons that are behind this behavior:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
44 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
• The amount of research in the domain is quite high which translates in the
diversity of analytical methodologies available. These analytical
methodologies define how the data is prepared and which families of learning
algorithms are employed.
• The lack of a body of knowledge formally described in a single book that
should contain the set of principles that would be generally accepted as best
practices by all the practitioners (the laws of predictive analytics).
• The hybrid nature of predictive analytics blending different fields of sciences
(e.g. machine learning, statistics, mathematics) with business.
Another factor that certainly contributes to this state of things is the lack of
predictive analytics systems in the market with functionalities that strike the right
balance between exploitation of best practices and exploration of new approaches.
These systems could help both the novice practitioner in avoiding to make fatal errors
and the experienced practitioner in focusing on open issues while the system would
give precise guidance for recurrent issues without any extra distraction. Most of the
predictive analytics systems that exist in the market are in the form of a workbench [3].
These systems typically provide:
This is good for practitioners that need to explore different approaches to solve
sophisticated problems but they are too open ended to efficiently tackle classic
predictive analysis. The following section will describe the different predictive models
required to deliver the vision explained in the first part of the article. We will see that
there are different levels of complexity in this set of predictive models.
Let us take as an example the different predictive models required to deliver the
predictive analyses for B2B described in the first part of this article. Considering 10
distinct financial services provided by a Bank, we would need to develop about:
This gives a total of 65 models and we are assuming here that there would not be
any pre-modelling segmentation of either non-customers or customers. A possible
segmentation could be to split companies in small, medium and large companies. This
would increase even more the number of required models to deliver the business design
described in the first part of this article. The number of models to give a good picture
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 45
of the bankruptcy and attrition risk might be more than 2 but for the sake of simplicity
we consider here only 2 models.
Predictive models can be classified according to many criteria. For the purposes of
best practices identification, we will classify the 65 models according to the nature of
the target variable: the target variable can either be a binary variable or a continuous
numerical variable. The 65 models can be divided into three levels of complexity:
• There are 43 models with a binary target variable. These are the 20 P2H, the
20 P2B and the models to estimate the risk of bankruptcy and attrition.
• There are 11 models with a continuous numerical target variable for non-
customers. These are the 10 models to estimate the value that will be
generated through a specific financial service plus 1 model to estimate the
overall value generated by the non-customer.
• There are another 11 models with a continuous numerical target variable but
for customers.
The set of models that are the easiest to create are the 43 binary models. There are
several bibliographical references (e.g. [4]) about the creation of binary models. We
can find many receipts for the problems related to the creation of a binary model
ranging from features selection to avoidance of overfitting.
The continuous models try to estimate which would be the value generated if the
Bank could capture the total wallet of the customer. This wallet can be shared by
different banks and there can even be a part of the wallet that is simply not reached by
any of the banks (One can argue that this last case only happens at companies with lazy
chief financial officers!).
The fact that the numerical value to be estimated is unknown (the total wallet size)
turns the development of the continuous models particularly difficult. For the binary
models, the target concerns the ownership (or not) of a product with the Bank. This is
an event that can be precisely identified with the internal data.
The aim of this part of the article is not to delve into all the details of sophisticated
problems as the estimation of share of wallet but to expand on the best practices to
solve classical problems such as the creation of a binary model. However, the problem
of share of wallet estimation is explained further here to better illustrate the situation
when innovation should be the tone instead of applying best practices.
Basically, the total wallet size of a company for financial services is the total
amount of money that one could expect that a company would either invest in financial
assets or borrow from Banks at a certain period (e.g. a year). Since this amount is not
readily observable, a possible alternative is to directly ask to companies for this amount
for example through a survey. There is more than one obstacle to complete this task
successfully (not even to mention the ability to convince companies to answer this kind
of inquiries):
• reach the right people at each company that will be actually able to identify
the right answer;
• make sure that the answer is not just related to a part of the company but really
to the entire company;
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
46 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
• make sure that every company understands in the same way the questions that
are being made in the survey so we can be sure that we are comparing apples
with apples from the survey’s replies;
A more analytical (and feasible) approach is to consider that the total wallet size of
a certain company should be no less than the actual amounts that “similar” companies
already are spending with the Bank. The trick here is to get the part of deciding who is
similar to whom right. It just so happens that there is an analytical tool that gives
exactly this information: quantile regressions. Just like a classical regression analysis
will estimate the average of the distribution of the independent variable for
observations close one of the other in the observational space, quantile regressions will
do exactly the same but instead of the average will achieve the regression of some
precise quantile at our wish. For instance, if we decide to estimate the 90th percentile,
this is equivalent to consider the total wallet size of a company as the observation point
that can be found on the 90th percentile of the distribution of the actual values for
“similar” companies as identified by the quantile regression model itself.
One detailed example of the application of quantile regression to estimate the share
of wallet can be found in [5]. The technique of quantile regression is fairly unknown to
the mainstream analyst but is by no means a recent invention. Just as Tuckey puts it
already back in 1977 [6]:
“What the regression curve does is give a grand summary for the averages of the
distributions corresponding to the set of x’s. We could go further and compute several
different regression curves corresponding to the various percentage points of the
distributions and thus get a more complete picture of the set. Ordinarily this is not
done, and so regression often gives a rather incomplete picture. Just as the mean gives
an incomplete picture of a single distribution, so the regression curve gives a
corresponding incomplete picture for a set of distributions.”
The problem at hands consists in developing binary models about the ownership (or
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
not) of a financial product. This is a common task among projects related to data
mining. This task should be executed applying best practices instead of trying to re-
invent the creation process of a binary model.
Virtually any analytical assignment can be decomposed into four main phases:
The problem understanding and analytical solution design are the critical and
distinguishable phases of an analytical assignment. As the moto goes, the right answer
to the wrong question is no better than a plain wrong answer to the right question.
What is meant by predictive analytics best practices has to do with the process of
implementation. At this point of the assignment, the analyst typically has a flat data file
as input and a binary model is the expected output. Most of the analysts reading this
text will identify themselves with the following set of questions:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics 47
• How can I select the right variables to create my model when my flat data file
contains thousands of input variables?
• Should I treat in any special way the outliers? How?
• What about missing values? Are they going to affect my model if I do nothing
about them?
• Which learning algorithm should I use? Should I use different techniques
through an ensemble model approach?
• Depending of the learning algorithm that I am going to use, which statistical
data assumptions should I validate before I employ it?
• How should I use the available data in the flat file to avoid the overfitting
trap?
• How should I use the available data in the flat data file to make sure that I will
be able to correctly evaluate the generalization capacity of my final model?
• Do I have enough positive cases to really learn some patterns or do I need to
do something to take care of this?
• Assuming that my learning algorithm of choice has parameters to fine tune,
which settings should I use?
• Exactly which evaluation measures will give me the most relevant insight
about the quality of my model?
• Is it ok to use my model output as it comes or do I need something else like a
probability?
Why is it necessary to address all these questions over and over again on each
analytical assignment and devise a solution from scratch? The time spent in this way
would have a much better pay-off if applied in the define and design phases instead.
The best practices should point out to a set of analytical approaches that enable to
flawlessly achieve a certain task within a limited amount of time. What we advocate is
exactly a predictive analytics system that receives as input the flat data file with the
identification of which variable is target and which variables are possible inputs and
delivers as output a binary model.
This predictive analytics system will take care of all the questions enlisted above
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
using embedded best practices that applied together can make a huge difference in the
quality of the final model produced.
While this system can’t assure the best possible model at any scenario, a quite
good model can be achieved in a fairly short time. The system was used in the PAKDD
2007 competition achieving the 6th place [7]. The total effort invested by the analyst in
this competition was of 1 man-day. This goes from reading the contest rules and
understanding the problem and the data until achieving the final model reported.
The curious reader may find more information of the best practices in the paper
[8]. Equipped with such a system, the task of creating 43 binary models in a project
life-time with flat data files in the order of thousands of variables and hundreds of
thousands of rows do not look so daunting.
3. Conclusion
This text has described the potential of using predictive analytics in a B2B setting.
There is much more data available about companies than predictive analytics
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
48 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics
References
[1] The European Commission Home Page for Data Protection as of May 2008
http://ec.europa.eu/justice_home/fsj/privacy/index_en.htm
[2] Web site of Bank for International Settlements as of May 2008 http://www.bis.org/
[3] Poll about data mining software used in real projects as of May 2008
http://www.kdnuggets.com/polls/2008/data-mining-software-tools-used.htm
[4] P.N. Tan, M. Steinbach, V. Kumar. Introduction to Data Mining. In Addison-Wesley Longman
Publishing Co., Inc. , 2005
[5] S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss, and R. Lawrence. Wallet estimation models. In
International Workshop on Customer Relationship Management: Data Mining Meets Marketing, 2005.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
[6] F. Mosteller, and J. Tukey. Data Analysis and Regression: A Second Course in Statistics. In Reading,
Mass.: Addison-Wesley, 1977.
[7] Web site of PAKDD 2007 Data Mining Competition as of May 2008
http://lamda.nju.edu.cn/conf/pakdd07/dmc07/index.htm
[8] T.V. de Merckt, J.F. Chevalier. PaKDD-2007: A Near-Linear Model for the Cross-Selling Problem. In
International Journal of Data Warehousing and Mining, Vol. 4, Issue 2, 2008
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 49
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-49
Abstract. Rigor data mining (DM) research has successfully developed advanced
data mining techniques and algorithms, and many organizations have great expec-
tations to take more benefit of their vast data warehouses in decision making. Even
when there are some success stories the current status in practice is mainly includ-
ing great expectations that have not yet been fulfilled. DM researchers have recently
become interested in utility-based DM (UBDM) starting to consider some of the
economic utility factors (like cost of data, cost of measurement, cost of class label
and so forth), but yet many other utility factors are left outside the main directions
of UBDM. The goal of this position paper is (1) to motivate researchers to consider
utility from broader perspective than usually done in UBDM context and (2) to in-
troduce a new generic framework for these broader utility considerations in DM re-
search. Besides describing our multi-criteria utility based framework (MCUF) we
present a few hypothetical examples showing how the framework might be used to
consider utilities of some potential DM research stakeholders.
Keywords. utility-based data mining, data mining stakeholders, rigor vs. relevance
in research
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Introduction
Nowadays, the rapid growth of IT has brought tremendous opportunities for data collec-
tion, data sharing, data integration, and intelligent data analysis across multiple (poten-
tially distributed and heterogeneous) data sources. Since the 90s business intelligence has
started to play an increasing role in many organizations. Data warehousing and data min-
ing (DM) are becoming more and more popular tools to facilitate knowledge discovery
and contribute to decision making.
Yet, DM is still a technology having great expectations to enable organizations to
take more benefit of their huge databases. There exist some success stories where or-
ganizations have managed to take competitive advantage of DM. Still the strong focus
of most DM-researchers in technology-oriented topics does not support expanding the
scope in less rigorous but practically very relevant research topics. The current situation
with DM has similarities with situations during the development of some other informa-
1 Corresponding Author: Department of Computer Science, Eindhoven University of Technology,
P.O. Box 513, 5600 MB Eindhoven, the Netherlands; E-mail: [email protected].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
50 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
tion technology (IT)-related sub-areas earlier. Research in the Information Systems (IS)
discipline (one of those IT-related sub-areas) has strong traditions to take into account
human and organizational aspects of systems beside the technical ones.
We have suggested a provocative discussion on why DM does not contribute to busi-
ness [30] and emphasized further in [32] that user and organization related research re-
sults and organizational settings used in IS discipline include essential points of view
which might be reasonable to take into account in developing DM research towards prac-
tically more relevant directions in domain areas where human and organizational things
matter. As IS also DM research has several stakeholders, the majority of which can be
divided into internal and external ones each having their own and commonly conflicting
goals. Currently, DM researchers rarely take industry (the most important external stake-
holder) into account while conducting their often rigorous research activities. This holds
even in the industry context where meaning, design, use, and structure of a DM artifact
is an important topic. The situation is still more complicated because outputs vary signif-
icantly by industry affecting the meaning and measurement of utility and performance.
Although, recent development in cost-sensitive learning and active learning has
started to consider some of the economic utility factors (like cost of data, cost of mea-
surement, cost of class label and so forth), yet many other utility factors are left outside
the main directions of the emerging utility-based DM research (UBDM).
For us DM is inseparably included as an essential part of the knowledge discovery
process and we see that a more holistic view of DM research is needed. If we, as DM
researchers, want to participate this kind of research efforts then we need to take under
investigation also utility related topics. Simple assessment measures like predictive ac-
curacy have to give way to economic utility measures, such as profitability and return on
investment. But on the other hand DM systems have their own peculiarities as IS systems
which should be taken into account also in the holistic view of DM systems research.
Thus, the goal of this paper is (1) to motivate the DM researchers to consider the
possibility to take utility aspects into account from a broader perspective than is usually
done in UBDM context nowadays and (2) to introduce a generic framework for utility
considerations in DM research with a few examples from the point of view of some
hypothetical DM research stakeholders.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Fayyad in [16] defines knowledge discovery from databases KDD as “the nontrivial pro-
cess of identifying valid, novel, potentially useful, and ultimately understandable pat-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 51
terns in data”. Before focusing on discussion of the DM process, we would like to make
a note that this definition by Fayyad is very capacious, it gives an idea what is the goal of
DM and in fact it is cited in many DM related papers in introductory sections. However,
in many cases those papers have nothing to do with novelty, interestingness, potential
usefulness and validity of patterns which were discovered or could be discovered using
the DM techniques proposed later in those papers.
DM process comprises many steps, which involve data selection, data preprocess-
ing, data transformation, search for patterns, and interpretation and evaluation of patterns
[16]. These steps start with the raw data and finish with the extracted knowledge, which
was acquired as a result of the whole DM/KDD process. The set of DM tasks used to
extract and verify patterns in data is the core of the process. Most of current DM/KDD2
research is dedicated to the pattern mining algorithms, descriptive and predictive model-
ing of data. Nevertheless, this core process of search for potentially useful patterns typ-
ically takes only a small part (estimated at 15% - 25%) of the effort of the overall KDD
process. The additional steps of the KDD process, such as data preparation, data selec-
tion, data cleaning, incorporating appropriate prior knowledge, and proper interpretation
of the results of mining, are also essential to derive useful knowledge from data.
The life cycle of a DM project according to the CRISP-DM model (Figure 1) con-
sists of six phases (though the sequence of the phases is not strict and moving back and
forth between different phases normally happens) [8]. The arrows indicate the most im-
portant and frequent dependencies between phases. And the outer circle in the figure
denotes the cyclic nature of DM – a DM process continues after a solution has been
deployed. If some lessons are learnt during the process, some new and likely more fo-
cused business questions can be recognized and subsequently new DM processes will be
launched.
CRISP-DM has much overlapping with Fayyad’s view. However, we would like to
emphasize that the DM process is put now explicitly in a way into some business context
that is represented by the business understanding and deployment blocks.
For us it is natural to define utility as a measure of overall (e.g. economic) benefit,
in UBDM and thus the concept of utility, consequently, should be connected with the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
entire DM process. In the rest of this section we review UBDM research directions and
summarize the steps of the CRISP-DM process which they seem to take into account.
Considerations of costs and benefits are common for all managerial decisions in or-
ganizations. Consequently, the quality of a DM artifact and its output must be evaluated
considering its ability to enhance the quality of the resulting decision. Most of early work
in predictive DM did not address the different practical issues related to data preparation,
model induction and its evaluation and application. Cost-sensitive learning [38] research
emerged initially in DM as an effort to reflect the relevance of incorporating the costs
resulting from a decision (based on prediction of DM model). Many application areas
of DM suggested that e.g. for classification, the costs to predict class membership of
instances accurately are proportional to the amount of accurately predicted instances yet
needed to account for the asymmetric costs associated with true versus false prediction
2 We would like to clarify that according to Fayyad’s definition, and some other research literature, DM is
commonly referred to as a particular phase of the entire process of turning raw data into valuable knowledge,
and it includes the application of modeling and discovery algorithms. In industry, however, both knowledge
discovery and DM terms are often used as synonyms to the entire process of producing valuable knowledge
from data.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
52 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
Figure 1. CRISP-DM: CRoss Industry Standard Process for Data Mining [8]
of positives and negatives. The knowledge of this asymmetry can be used to guide the
parameterization of a classifier and selection of the most appropriate one (e.g. MetaCost
[14] or cost sensitive boosting [15]). This leads to the development of robust evalua-
tion techniques like the ROC convex hull method [31] or the area under the ROC curve
(AUC) [6] which can be utilized when considering the business problem and managerial
objectives.
As DM is commonly referred to secondary data analysis, it is often assumed that a
fixed amount of training data (being collected for some other purposes) is available for
the current goal of knowledge discovery. Consequently, it is assumed by many developers
of DM techniques that data is given and there are no costs associated with the availability
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
of the data. However, sooner or later is becomes evident that availability of data for
analysis (and especially the availability of labeled data for supervised learning) affects
the economic utility of acquiring training data (or labeling unlabeled data) and therefore,
should be considered as the costs of building a model, and applying the model.
Thus, e.g. in medical diagnostics the general problem can be formulated as: given
the costs of tests and the total fixed budget, decide which tests to run on with which
patients to obtain additional information needed to produce effective classifier (assuming
that no or little training data is available initially) [18]. Then, cost consideration includes
the costs associated with building the classifier and the costs (and benefits) associated
with applying the classifier.
Evaluation of cost-sensitive learners was studied in [22], where cost curves which
enable easy visualization of average performance (expected cost), operating range, confi-
dence intervals on performance, and difference in performance and its significance were
introduced.
Thus, it seems that at least most of the current research in UBDM inclines to cost-
sensitive learning and active learning paradigms from a machine learning perspective,
and refers total utility as derived from the following DM related processes:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 53
data preparation3 – costs of acquiring the data, including primarily the costs of (1)
measuring an attribute value, (2) data labeling for supervised learning, (3) data records
collection/purchase/retrieval, and (4) data cleaning and preprocessing;
data modeling and evaluation – costs of searching for patterns in the data, costs of
misclassification, benefits of using the discovered patterns/models4 .
Thus, deployment and impact estimation of use oriented steps of the KDD are
currently almost completely ignored in (UBDM) DM research.
Even if UBDM researchers say that the goal of UBDM is to act so as to maximize
the total benefit of using the mined knowledge minus the costs of acquiring and mining
the data, yet it does not assume the thorough analysis of use-oriented steps of the KDD
process and accounting for various benefits (and risks) associated with them. The mined
knowledge is of utility to a person or an organization if it contributes to reach a desired
goal. Utility based measures in itemset mining use the utilities of the patterns to reflect
the user’s goals. Yao et al. [41] review utility based measures for itemset mining and
present a unified framework for incorporating several utility based measures into the DM
process by defining a unified utility function. Objective and subjective interestness of the
results of the association analysis have been studied by several authors from different
perspectives [25][7][27][33]. Yet, it is always assumed that there is a (single type of) user
and that the user is able to clearly formulate business challenges and to help to find an
appropriate transformation of them into a set of DM tasks or simply pick up one of the
suggested solutions.
In the following section we consider different types/levels of DM use and DM stake-
holders emphasizing the differences in utility considerations depending on the type of
DM use or the type of DM stakeholder.
have effect to the impacts. In the CRISP-DM model considered in the previous section
the starting point is set as the business understanding [8]. A closer look at that part of
model is represented in Figure 2 and discussed below. The initial phase of the model
focuses first on understanding the objectives and requirements for the DM project from a
business perspective. After that the project is considered in details as a DM problem and
a preliminary plan is designed to achieve the objectives.
In the CRISP-DM model the first phase aiming at thoroughly understanding the
user’s true business needs is a very challenging one. Often users have many competing
and even conflicting objectives which need to be uncovered and balanced at the greatest
possible extent in the very beginning of the modeling. Recognizing this need is a good
starting point but as well known in IS field the needs of users are commonly hard to
discover. In the CRISP-DM model the next phases are to describe the customer’s primary
3 Data preparation step can be associated not only with data preprocessing, data selection and data transfor-
mation processes, but also with data collection, data acquisition and/or data labeling.
4 Currently this direction is limited to accounting for some economic benefits, known (or believed to be
known) in advance without estimation of actual individual or organizational impact using the DM artifact.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
54 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
objective and the criteria for a successful or useful outcome of the project from a business
perspective. These criteria commonly include things to be evaluated subjectively but may
include some aspects which can be even measured objectively.
In the CRISP-DM model the next phase includes deep evaluation of the starting
situation including many aspects. In addition to the possible assumptions that have been
made about the data to be used during DM, also the possible key assumptions about
the business are important. Especially if they are related to conditions on the validity
of the results and legal aspects beside ordinary ones (as resources available, constraints,
schedule, and security). A very important part of the evaluation is to consider the possible
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 55
The final phase of the business understanding step in the CRISP-DM model (Fig-
ure 2) is producing a project plan. This plan needs to include beside the typical elements
of a project plan (as duration, required resources, inputs, outputs and dependencies of
each step) also DM specific elements (as large-scale iterations of the modeling and eval-
uation phases with evaluation strategy and analysis of dependencies between time sched-
ule and risks with actions and recommendations if the risks appear). Project plan also
includes an initial selection of tools and techniques to be applied.
The above CRISP-DM model besides giving normative advices for the DM process
also mentions many important business related utility aspects. We would like to stress
here again that because the majority of DM efforts omit utility considerations, most of the
state-of-the-art DM artifacts do not allow searching directly for descriptive and predictive
models by specifying desired utility-related parameters.
One typical approach employed in practice is the use of feedback from domain ex-
perts on whether they find DM artifact and what it outputs to be useful (insightful, ap-
plicable, actionable, transparent, understandable etc.). The feedback is used to adjust the
data preprocessing steps or the parameters of a data modeling technique (or selection
of a particular technique) until an acceptable solution satisfying major expectations of
experts is found [29].
One way or another, there is a necessity to study what factors affect user acceptance
of DM artifacts in general and some learned models in particular. Pazzani in [29], states
that studying how people assimilate new knowledge could help the DM community to
design better KDD systems; in particular, instead of generating alternatives and testing
them for utility-related criteria, KDD systems would bias the search toward models that
meet these criteria. An interesting work in this direction is [3] where the authors were
trying to answer the question of what makes a discovered group difference insightful,
approaching two concrete research questions, “Is a discriminative or characteristic ap-
proach more useful for describing group differences?” and “How do subjective and ob-
jective measures of rule interest relate to each other?” by conducting a user (domain
experts) study. Unfortunately, such studies constitute an inessential minority of research
efforts in DM related areas.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
In the Information System (IS) discipline the IS success research has been practiced
for quite long and for example the most widely known Delone and McLean Success
Model [11] was based on reviewing 180 earlier studies. It has served since that as a ref-
erence model for many additional studies and Delone and McLean in their ten year sur-
vey [12] found more than one hundred research reports explicitly referencing the success
model. The enhanced model is included in Figure 3.
Bokhari in [5] used meta-analysis collecting a set of 55 papers from major journals,
conference proceedings, books and dissertations within the period 1979 to 2000 and tried
to explain and find empirical evidence about the relationship between system usage and
user satisfaction. These both have long been considered part of success metric but the
research related to the nature of their relationship has failed to reach an agreement about
the strength and nature of it [5]. Of course there are also articles discussing success
factors of certain kind of systems [34][23].
Maybe the first efforts to consider success-factors of the DM systems are the ones
presented in the DM Review magazine [9][19] including practice based success-factor
considerations. Coppock in [9] analyzed the failure factors of DM-involved projects. He
names four: (1) persons in charge of the project did not formulate actionable insights, (2)
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
56 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
System
Quality Use
Individual
Information Impact
Quality
User
Satisfaction
Service Organizational
Quality Impact
Figure 3. Adapted from D&M IS Success Model [11] (p.87) and updated D&M IS Success Model [12](p.24)
the sponsors of the work did not communicate the insights derived to key constituents, (3)
the results don’t agree with institutional truths, and (4) the project never had a sponsor
and champion. The main conclusion of Coppock’s analysis is that as in an IS the leader-
ship, communications skills and an understanding of the culture of the organization are
not less important than the traditionally emphasized quality of data and technological
skills to turn data into insights.
Hermiz in [19] communicated his beliefs that there are four critical success factors
for DM projects: (1) having a clearly articulated business problem that needs to be solved
and for which DM is a proper tool, (2) insuring that the problem being pursued is sup-
ported by the right type of data of sufficient quality and in sufficient quantity for DM, (3)
recognizing that DM is a process with many components and dependencies – the entire
project cannot be “managed” in the traditional sense of the business word, and (4) plan-
ning to learn from the DM process regardless of the outcome, and clearly understanding,
that there is no guarantee that any given DM project will be successful.
Lin in [40] notices that in fact there have been no major impacts of DM on the
business world echoed. However, even reporting of existing success stories is important.
Giraud-Carrier in [17] reported the summary of 136 success stories of DM, covering 9
business areas with references to 30 DM tools or DM vendors. Unfortunately, there was
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 57
the definition of “significant research” in IS area. All participants were strongly giving
value to the influences that IS has outside the IS research society. Among answers some
even wanted to have a very broad view including solving societal problems and creating
wealth among the significant research when some others were more concerned the issues
related to the more close stakeholders and around the IS field. In DM research we have
still a lot to do to “take good care of our own backyard” as El Sawy [13] (p.343) expressed
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
his opinion about the key characteristics of the IS research that really matters. In the next
section we continue broadening the utility considerations to the different groups of users
which have many different utility preferencies.
To consider the different utility aspects of DM research we first consider the possible
understandings of the group of stakeholders of DM research. As a stakeholder of DM
research we understand a person or an organization that has a legitimate interest in DM
research or its results. We divide the stakeholders of DM research into two main groups:
(1) internal stakeholders that are stakeholders within academia, and (2) external stake-
holders that are all the others outside academia (as usual in IS discipline and suggested
to bring in the DM research in [32]).
Related to IS research, some authors stress the need to recognize its stakeholders
[21] (p.249). They mention as external stakeholders of publicly-funded IS research the
following: industry shareholders and their agents (management), the employees of firms
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
58 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
and organizations, their agents (unions), community and other levels of government and
the general public. Beside external stakeholders they refer to [4] that IS researchers have
important stakeholders within academia, as funding agencies, colleagues in other disci-
plines, university administrators, and students.
When the panelists were discussing IS research that really matters they were also
asked their opinion about IS research stakeholders [13]. Several traditional aspects came
up, as the academic community (professional peers, students, journals which regulate
and disseminate publication of research, as well as academic research funding institu-
tions), the business community (managers and professionals including consultants who
use IS to manage, as well as those who design, build, and manage IS), and the non-profit
organization community. There were also opinions about more broad interpretation of
stakeholders including all those who are effected by IS (all human beings and even the
entire human race around the world now and in future) and concerns about freedom to
select, run, and publish IS research.
The main publicly funded DM research has concentrated on the development of
new algorithms or their enhancements and left the DM developers in domain areas to
take into account for example the cost considerations: investment in research, product
development, marketing, and product support (Lin in [40]). However, we have raised the
questions, “Is it reasonable that DM researchers leave the study of the DM development
and DM use processes totally outside their research area?”, “Are these equally important
aspects going to be handled better by the researchers of other areas, and DM researchers
should also in the future concentrate on the technological aspects of DM only?” [32]. In
any case it is evident that DM research has both external and internal stakeholders, as IS
do and DM researchers themselves need to decide which ones of the stakeholders and
their utilities are considered in future by DM researchers.
After recognizing the stakeholders it is necessary to consider what the relevancy
means for them. This is required to be able to consider their utilities with respect to re-
search. As in IS research, [21] focus is on the most commonly espoused group, the in-
dustry management considering two subgroups: senior management and the practition-
ers in IS departments. The internal stakeholders consider also two groups: IS research
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 59
sidering stakeholder’s interests might still not be irrelevant. It is further noticed that dif-
ferent stakeholder groups tend to possess conflicting interests arising from their different
value systems. Thus IS research relevancy depends on judgments that should be made
explicit. Hirschheim and Klein in [21] (pp.250-253) have recognized that both the busi-
ness community and the academic community have not managed to justify their expec-
tations about IS research. They blame the IS research community has done a very poor
job of communicating and add that if the IS researchers truly believe that their theories
are relevant for practitioners they should communicate their results better. On the other
hand “the view of IS held by IS-practitioners is at best only partially supported by some
theories that guide IS research” (ibid. 253) and the view of non-IS practitioners is still
“even more at odds”.
In his introduction of JAIS special section on IS research perspectives former se-
nior editor Straub made recently a reference in [36] to [1] where authors have classified
knowledge sources into three broad dissemination types: (1) academic, (2) practitioner
or professional, and (3) academic-practitioner. He further refers to [2] and [37] that they
have argued that academics differ from practitioners in their focus on conceptual clarity
concluding “that managers who value definitional clarity will need to seek out academic
venues since little of this information [i.e. clear definitions of concepts] will be found
elsewhere” [36] (p.242).
We consider DM as an inseparable essential part of the knowledge discovery pro-
cess, and think that a more holistic view is needed in DM research. If this is accepted,
the DM researchers have to take under investigation also more and more utility-related
topics in larger scales. Simple assessment measures like predictive accuracy have to give
way to economic utility measures, such as profitability and return on investment beside
the more narrow economic ones as in cost-sensitive learning. In the following we con-
centrate to consider only the most important stakeholders and suggest the use of a generic
framework for utility considerations. This framework is based on multi-attribute additive
function [24] represented below:
m
m
V (ai , w) = w j v j (ai ), wher e w j = 1, andw j ≥ 0. (1)
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
j=i j=i
The v j (ai ) is the value of an alternative DM project (later alternative, for short)
ai for criteria v j and its weight when the global value V (ai , w) of the alternative ai is
calculated is w j . The weights indirectly reflect the importance of the criteria.
Evaluating alternative DM projects (alternatives ai ) as investments can be based on
different investments appraisal techniques which attach values of alternatives (v j (ai ))
taking different criteria into account. One traditional investments appraisal technique is
Parker et al’s [28] information economics considering two domains: business and tech-
nology. The business domain includes the following factors: (1) Return on investment,
(2) Strategic Match, (3) Competitive Advantage, and (4) Organizational risk. The tech-
nology domain includes the factors: (1) Strategic architecture alignment, (2) Definitional
uncertainty risk, (3) Technical uncertainty, and (4) Technology infrastructure risk.
These factors have been regrouped in [35] into two main criteria: value and risk. In
this new structure, the value criteria contains business factors except the Organizational
risk and one domain factor, Strategic architecture alignment. The risk criteria includes
the other four factors. They have further extracted 27 detailed criteria from IT/IS man-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
60 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
Figure 5. The generic framework for utility evaluation from a DM stakeholder point of view.
(w1 , w2 , w11 , ..., w1n , and w21 , ..., w2n are weights)
agement literature and demonstrated one possible setting of these detailed criteria under
the main ones. They have further changed the multi-attribute additive function following
multi-criteria utility theory approach where the values of alternatives (v j (ai ) in the above
formula) are scaled into utilities on an interval from zero to one.
The four level selection model, where the top level (level 1) is giving net utility
for each project (V (ai , w) in the formula) based on their two equally weighted main
criteria in level 2: value and risk of the project alternative. Both the value and risk main
criteria are composed of four level 3 sub-criteria (mentioned above) both with their own
weight structure. The detailed criteria are at level 4 and the utility value of each level 3
sub-criteria is decided taking into account 2-4 of them using different weight structures
[35].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
We enhance this simple and flexible tree structure into another structure where dif-
ferent sub-criteria can effect more than one main criterion and allow this approach to
go down to the lower levels too. In our new generic multi-criteria utility based frame-
work (MCUF) the weights of the two main criterion w1 and w2 (see Figure 5) can be
fixed taking into account the context of decision making and the preferences of the stake-
holder (for example is he risk seeking or risk averse). Also the weights w11 , ..., w1n and
w21 , ..., w2n used to calculate the utility values of the main criteria based on the utility
values of the sub-criteria are allowed to be fixed in a context and stakeholder dependent
way. The weight of some criterion is allowed to be zero as appears in the examples of
next chapter when some sub-criteria are dropped from the corresponding figure.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 61
Figure 6 has roots in CRISP-DM (remember Figure 1). However, here we emphasize
the importance of Use process that can start after there exists a ready-to-use DM artefact
(that is the result of the development, implementation, evaluation, and deployment of
certain DM solution that addresses a recognized business challenge).
We emphasize also that the use process is connected to a certain type(s) of DM
stakeholders (in the middle of Figure 6). They may have different (and potentially com-
peting) business challenges. Their use of DM artifact leads to certain individual and/or
organizational impacts (remember the success model from Figure 3), which needs to be
evaluated for utility considerations.
In Figures 7 – 9 we show how our generic framework (MCUF) accounts for the
different utility considerations with regard to the different groups of the DM stakehold-
ers. In Figure 7 we demonstrate one hypothetical example of utility consideration by top-
management in some organization (i.e. external customer, who can adopt potentially DM
artifact for managerial decision making) in a hypothetical situation when she is deciding
about DM project. Her main criteria are value and risk. The traditional investments ap-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
praisal technique is information economics considering two domains: business and tech-
nology domain [28]. Those typical eight criteria (sub-criteria in Figures 7) are affecting
the main criteria (i.e. value or risk) with corresponding weigh wi j , i = 1, 2; j = 1, ..., 8.
We emphasize with zero weights the fact that some sub-criteria may impact only one
group of main criteria, .
In Figure 8 utility consideration of a domain expert is presented (i.e. another type
of an external customer group, who can use DM artifact for decision support in daily
operational decision making e.g. in diagnostics). We highlight here such sub-criteria that
impact the overall utility from DM tool from the domain expert point of view such as:
satisfaction from use, possible changes in responsibilities for made decisions, whether
tool is transparent in functionality and the results are easy to interpret, whether training
and support is likely to be needed (and provided), and finally, what the overall impact
from the use of the tool will be. We can see that these utility considerations (in fact,
potentially related to the same DM artifact) differ quite a lot from the ones of the previous
group of stakeholders.
Figure 9 illustrates a different type of example when an editor (or a peer reviewer)
of a DM journal (the example of an internal customer) needs to decide whether to accept
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
62 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
Figure 8. Example of utility consideration by a domain expert when deciding whether to (continue to) use a
DM artifact (weights omitted from figure).
a submitted paper or not. Here, also a set of sub-criteria can be thought that impact major
value or risk criteria: how relevant to the scope of the journal the paper is, how rigor the
methods are, how relevant the results are, what the impact to credibility of the journal
and DM field of this paper would be, whether research ethics is not violated and is paper
contents comprehensible.
5. Conclusions
Strong expectations have been lately loaded to the DM to help organizations and indi-
viduals get more utility from their databases and data warehouses. These expectations
are more based on the fine rigor research results achieved with technical aspects of DM
methods and algorithms than vast amount of practical success stories. The time when DM
research has to answer also the practical expectations is fast approaching. Who should
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 63
Figure 9. Example of utility consideration by an editor of an DM journal deciding on the acceptance of the
submitted DM paper (weights omitted from figure).
take care about research of users’ (both individuals’ and organizations’) goals and suc-
cess factors when they install and use DM (system) parts in their ensembles of IS func-
tionalities? Can this research be left to IS researchers or do the researchers of these topics
need some amount of DM knowledge also?
The goal of this paper is to raise these broader UBDM questions under discussion
of researchers and practitioners in the DM area. We first made a short review of the DM
research that has taken utility into account. The review is by no means covering all the
papers published but showing the main lines worked with. Then we took a closer look at
practice oriented normative advices included into the CRISP-DM model to motivate the
use and user orientation more broadly in the utility considerations. We shortly discussed
a well known success model and design science approach applied in IS discipline. We
considered more closely the stakeholders of DM research and especially their utilities.
We suggested a sketch of a new generic multi-criteria utility based framework (MCUF)
for more detailed analysis of different stakeholders’ utilities in their contexts. Also some
illustrative examples of the use of the framework in hypothetical context were given. The
framework still needs to be researched in various real application contexts in order to
validate it.
In our future work we plan to focus on a meta-analysis of the DM research tracing
its development, and to produce categorisations based on theory/practice orientation of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
examined DM research, the use of different kinds of research methods, and other criteria.
We plan to estimate approximately the proportions of published work in different
directions and different types of DM research through literature review and papers cat-
egorisation according to predefined classification criteria. Beside the analysis of the rel-
evant literature from the top international data-mining related journals and international
conference proceedings, we plan to collect and analyse the editorial policies of these
top international journals and conferences. This will result in better understanding of the
major findings and trends in the DM research area. We expect that this will also help us
to highlight the existing unbalance in the area, and suggest the ways of improving the
situation.
Acknowledgements
This research was partly supported by the Academy of Finland. We would like to thank
the reviewers for their constructive and detailed comments.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
64 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations
References
[1] N. J. Adler and S. Bartholomew. Academic and professional communities of discourse: Generating
knowledge on transnational human resource management. Journal of International Business Studies,
23(3):551–569, 1992.
[2] R. P. Bagozzi, Y. Yi, and L. W. Phillips. Assessing construct validity in organizational research. Admin-
istrative Science Quarterly, 36(3):421–458, 1991.
[3] S. D. Bay and M. J. Pazzani. Discovering and describing category differences: What makes a discovered
difference insightful. In Proc. of the 22nd Annual Meeting of the Cognitive Science Society, 2000.
[4] A. Bhattecherjee. Understanding and evaluating relevance in IS research. Communications of the Asso-
ciation for Information Systems, 6(6), 2001.
[5] R. Bokhari. The relationship between system usage and user satisfaction: a meta-analysis. The Journal
of Enterprise Information Management, 18(2):211–234, 2005.
[6] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms.
Pattern Recognition, 30(7):1145–1159, 1997.
[7] D. Carvalho, A. Freitas, and N. Ebecken. Evaluating the correlation between objective rule interesting-
ness measures and real human interest. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama,
editors, Proc. of the 16th European Conf. on machine learning and the 9th European Conf. on principles
and practice of knowledge discovery in databases ECML/PKDD-2005, pages 453–461. Springer, 2005.
[8] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0
Step-by-step data mining guide. The CRISP-DM consortium, 2000.
[9] D. S. Coppock. Data mining and modeling: So you have a model, now what? DM Review Magazine,
2003.
[10] A. Cresswell. Thoughts on relevance of is research. Communications of the Association for Information
Systems, 6(9), 2001.
[11] W. DeLone and E. McLean. Information systems success: The quest for the dependent variable. Infor-
mation Systems Research, 3(1):60–95, 1992.
[12] W. DeLone and E. McLean. The delone and mclean model of information systems success: A ten-year
update. Journal of MIS, 19(4):9–30, 2003.
[13] K. Desouza, O. El Sawy, R. Galliers, C. Loebbecke, and R. Watson. Beyond rigor and relevance towards
responsibility and reverberation: Information systems research that really matters. Communications of
the Association for Information Systems, 16(16):341–353, 2006.
[14] P. Domingos. Metacost: a general method for making classifiers cost-sensitive. In Proc. of the 5th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999.
[15] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Adacost: misclassification cost-sensitive boosting. In
Proc. 16th International Conf. on Machine Learning, pages 97–105. Morgan Kaufmann, 1999.
[16] U. Fayyad. Data mining and knowledge discovery: Making sense out of data. IEEE Expert, 11(5):20–25,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
1996.
[17] C. Giraud-Carrier. Success Stories in Data/Text Mining. Brigham Young University, 2004.
[18] R. Greiner. Bugdeted learning of probabilistic classifiers. In Proc. of the Workshop on Utility-Based
Data Mining (UBDM’06), Invited Talk, 2006.
[19] K. Hermiz. Critical success factors for data mining projects. DM Review Magazine, 1999.
[20] A. R. Hevner, S. T. March, J. Park, and S. Ram. Design science in information systems research. MIS
Quarterly, 28(1):75–105, 2004.
[21] R. Hirschheim and H. Klein. Crisis in the IS field? A critical reflection on the state of the discipline.
Journal of the Association for Information Systems, 4(10):237–293, 2003.
[22] R. C. Holte and C. Drummond. Cost-sensitive classifier evaluation. In Proc. of the 1st Int. Workshop on
Utility-Based Data Mining, UBDM ’05, pages 3–9. ACM Press, 2005.
[23] M. Kamal. It innovation adoption in the government sector: Identifying the critical success factors. The
Journal of Enterprice Information Management, 19(2):192–222, 2006.
[24] R. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Wiley:
New York, 1976.
[25] D. Luo, L. Cao, C. Luo, C. Zhang, and W. Wang. Towards business interestingness in actionable knowl-
edge discovery. In C. Soares, Y. Peng, J. Meng, T. Washio, and Z.-H. Zhou, editors, Applications of
Data Mining in E-Business and Finance, pages 99–109. IOS Press, 2008.
[26] G. Melli, O. R. Zaïane, and B. Kitts. Introduction to the special issue on successful real-world data
mining applications. SIGKDD Explorations, 8(1):1–2, 2006.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 65
[27] M. Ohsaki, H. Abe, S. Tsumoto, H. Yokoi, and T. Yamaguchi. Evaluation of rule interestingness mea-
sures in medical knowledge discovery in databases. Artificial Intelligence in Medicine, 41(3):177–196,
2007.
[28] B. R. Parker, M. and H. Trainer. Information Economics: Linking Business Performance to Information
Technology. Prentice-Hall, Englewood Cliffs, NJ., 1988.
[29] M. Pazzani. Knowledge discovery from data? IEEE Intelligent Systems, 15(2):10–13, 2000.
[30] M. Pechenizkiy, S. Puuronen, and A. Tsymbal. Why data mining does not contribute to business? In
C. S. et al, editor, Proc. of Data Mining for Business Workshop, DMBiz (ECML/PKDD’05), pages 67–
71, 2005.
[31] F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under
imprecise class and cost distributions. In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data
Mining, 1997.
[32] S. Puuronen, M. Pechenizkiy, and A. Tsymbal. Keynote paper: Data mining researcher, who is your cus-
tomer? some issues inspired by the information systems field. In Proc. of the 17th Int. Conf. on Database
and Expert Systems Applications DEXA’06, pages 579–583. IEEE Computer Society, Washington, DC,
2006.
[33] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE
Transactions on Knowledge and Data Engineering, 8(6):970–974, 1996.
[34] P. Soja. Success factors in erp systems implementations: Lessons from practice. The Journal of Enter-
price Information Management, 19(6):646–661, 2006.
[35] R. Steward and S. Mohammed. It/is projects selection using multi-criteria utility theory. Logistics
Information Management, 15(4):254–270, 2002.
[36] D. Straub. The value of scientometric studies: An introduction to a debate on IS as a reference discipline.
Journal of AIS, 7(5):241–246, 2006.
[37] L. Van Dyne, L. Cummings, and J. McLean Parks. Extra role behaviours: In pursuit of construct and
definitional clarity (a bridge over muffled waters). Research in Organizational Behavior, 17:215–285,
1995.
[38] S. Viaene and G. Dedene. Cost-sensitive learning and decision making revisited. European Journal of
Operational Research, 166:212–220, 2004.
[39] G. Weiss, M. Saar-Tsechansky, and B. Zadrozny. UBDM ’05: Proc. of the 1st Int. workshop on Utility-
based data mining. 2005.
[40] X. Wu, P. S. Yu, G. Piatetsky-Shapiro, N. Cercone, T. Y. Lin, R. Kotagiri, and B. W. Wah. Data mining:
How research meets practical development. Knowledge and Information Systems, 5(2):248–261, 2000.
[41] H. Yao and H. J. Hamilton. Mining itemset utilities from transaction databases. Data and Knowledge
Engineering, 59(3):603–626, 2006.
[42] B. Zadrozny, G. Weiss, and M. Saar-Tsechansky. Proc. of the 2nd Int. Workshop on Utility-based data
mining, UBDM ’06. 2006.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
66 Data Mining for Business Applications
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-66
Introduction
1
Corresponding author: Numetrics Management Systems, Inc., 20863 Stevens Creek Blvd.,
Suite 510, Cupertino, CA 95014; E-mail: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models 67
1996. Since then, we have accumulated a rich history of experiences with creating
predictive products, as well as with selling and supporting them. Customers must be
confident that our applications give accurate results to rely on them for business-critical
decisions. Validating the predictions is therefore an essential step to acceptance.
Customer validation is similar to the traditional mathematical validation of data
mining algorithms and predictive models. However, in many ways, customer
validation comprises a superset of the difficulties and challenges of mathematical
validation. In our experience with applying data mining technology to real industry
data and actual business problems, data mining currently focuses predominantly on a
small fraction of the entire problem. Kohavi & Provost [1] capture our own assessment
of the situation well when they state:
“It should also be kept in mind that there is more to data mining than just building an
automated [...] system. […]. With the exception of the data-mining algorithm, in the current
state of the practice the rest of the knowledge discovery process is manual. Indeed, the
algorithmic phase is such a small part of the process because decades of research have focused
on automating it – on creating effective, efficient data mining algorithms. However, when it
comes to improving the efficiency of the knowledge discovery process as a whole, additional
research on efficient data mining algorithms will have diminishing returns if the rest of the
process remains difficult and manual. […] In sum, […] there still is much research needed –
mostly in areas of the knowledge discovery process other than the algorithmic phase.”
In this paper, we will explore the specific research needs, which relate to customer
validation of predictive models.
1. Background
Data mining experts and customers of data mining technology do not necessarily
share the same training and background. Data miners typically have thorough
knowledge of data mining as well as statistical training. For example, some of the
more widely read overview texts on data mining are Berry & Linoff [2], Han &
Kamber [3], Mitchell [4], Quinlan [5], Soukup & Davidson [6], and Witten & Frank [7].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
68 T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models
The customer view of model validation is different from the academic view
because one can make few assumptions as to the statistical and data mining savvy of
the customer. Customers must focus on their business needs, and in our case, these
needs are those of the semiconductor business. Our customers view our product not so
much as a predictive model, but rather as a tool that can answer questions about cost,
time-to-market, and productivity of a semiconductor design project. Our customers do
not concern themselves primarily with the predictive model that is the engine inside the
application. Instead, they focus on the controls of the application, and in order to
receive value from the product, they need the application to address their business
needs directly and immediately.
When our customers ask, “how accurate is your model?” they have a broad and
diverse mental picture of what “accuracy” means. Many of our customers are
engineers, so they expect a meticulous response. The following table lists some of the
questions our customers are likely to ask when they validate a predictive model:
1. Our organization has collected 2. I have experience with version 3. How accurate is the
effort metrics on 50 completed N of the model. With version application, and how is accuracy
projects. How accurately does N+1 available, should I migrate to measured?
the application rank the expected version N+1? Is it better? How
effort for those projects? much better is it?
7. I know from experience that 8. I do not track Ring Oscillator 9. The application asks for clock
the number of capacitors on an Delay, but the application speed but my design is pure
analog design relates to effort. requires this input. Will the analog and has no clocks. What
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
How can the application predict application still be useful without should I enter? Is this model
effort accurately when we cannot this input, and how sensitive is useful to me?
enter the number of capacitors the application to inaccurately
into the application? entered data?
It is apparent that our customers’ questions are mostly specific to the domain of
semiconductor design. Even when our customers do not ask their questions explicitly
in the terminology of the semiconductor business, they would still like to obtain
answers in semiconductor design-project terminology. For example, when our
customers ask: “How accurate is the application, and how is accuracy measured?” they
would prefer an answer that uses their terminology, like “X% of projects complete
within Y% of the predicted completion date” to an answer which does not use their
terminology, like “The F-Score is X.” For comparison, the next table lists questions
that focus on data mining technology:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models 69
10. What is the area under the 11. What is the optimal number of 12. What is the Lift for this
Receiver-Operating- boosting operations? model?
Characteristic Curve?
13. What is the F-Score? 14. What is the Cross Entropy of 15. How well would the
the model? application perform on the Iris
Data Set [25]?
16. How imbalanced was the 17. Where is the precision/recall 18. Does the application use a
training data set? break-even point? Support Vector Machine?
These two sets of questions illustrate that customers think in terms of their field of
application rather than in terms of data mining, and more importantly, it is often not
clear how to translate one language into the other.
In addition to a desire for familiar terminology, there are other peculiarities about
customer validation. Some customers are more interested in “white box” validation
whereas others may be more interested in “black box” validation, where white box
validation considers how the model operates, and black box validation only considers
the behavior displayed by the application. Both types of customers would like to
receive answers to their questions. A further complication however, is that customers
who are interested in white box validation may want to understand the model in terms
of engineering equations they are familiar with, and they might want to see a formula
which describes how a specific input of interest affects the predictive output of the
model.
A related question arises in the context of planning: “how accurate is the
application’s estimate of effort for a specific new project that will be critical to the
customer’s future success?” This question cannot be answered based on a single project
(one observation), because during the project’s planning stage it is obviously
impossible to compare its predicted effort to its actual effort. Moreover, we must
measure model accuracy for a population of cases and although the model may be very
accurate across an entire population, it may provide less accurate results for a single
case. Generally, it is not clear whether and how it might be possible to obtain accuracy
estimates for specific cases. Such individual cases, or use cases, may be pivotal to the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
3. Analysis
The issue of customer-oriented evaluation has two components. The first is how to
place evaluations in the customer's language, which we will refer to as the “domain
language requirement”. The second component is perhaps more relevant to data
mining research: how to make sure that evaluations actually answer the wide range of
questions that customers will ask. We will refer to this requirement as the “evaluation
completeness requirement”.
One avenue to address the domain language requirement might be to use a generic
language that can be mapped more-or-less easily into the language for a particular
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
70 T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models
domain. How then do the different questions, in Table 1, correspond to such more
general questions in a more generic language? For example, Question-1 is a question
of applying the model to particular cases that the customer has in mind. Question-5
talks about the accuracy of the model on a particular subspace of the domain.
Question-8 talks about how to deal with missing values during model use, as opposed
to model construction. Question-9 talks about the inputs actually "used" by the model.
Table 3 provides such a mapping from customer language to machine learning
language for all sample questions listed in Table 1. In the left-hand column of Table 3,
we list customer’s domain-specific concepts related to the questions from Table 1. We
then match the fragments to approximately comparable ideas and approaches in
machine-learning language in the middle column. We also suggest some key machine-
learning concepts in the right-hand column, which appear to relate to the customer
concern in question.
Estimations based on Reference to potential use cases or background expert Use Case,
first hand experience knowledge. To achieve customer acceptance and win Background Expert
and intuition: Table 1, their business, it may be particularly important that the Knowledge,
items 1, 4, 7, 9 model perform well on these use cases. It may be Training Cases,
possible to improve model performance by incorporating Validation Cases
appropriate background expert knowledge or by capturing
additional training cases.
Knowledge of relative Statistical analysis of ranking, such as Spearman’s rank Rank Correlation
actual outcomes: item correlation coefficient may be a good tool for evaluating
1 model performance.
Concern over risk Compare specific alternative models in terms of their Model comparison
associated with performance and quality.
improvement vs.
stability: item 2
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Business risk due to References to model quality; Some are more generic, Model Quality,
potentially inaccurate while others are more specific and identify project Target Variables
estimations: items 3, 5, duration as a target variable.
6
Intuitions about Consider use cases where a specific input changes, such Model Sensitivity,
expected model as “frequency”, and review the impact in terms of the Specific Inputs
behavior in response to sensitivity of the model.
changes in input
values: items 4, 7, 8
Awareness that unique There are reportedly clusters of cases in the input space Case Clustering,
cases within the where the model should perform differently from how it Sub-Models,
domain require special performs in other ranges. It may be helpful to use Stratification
treatment: items 5, 6, 7 unsupervised learning to discover such clusters and to
offer cluster membership as an input. Alternatively, one
may build different models or sub-models to address
different sub-domains.
Insights about which A variable considered important by the customer is not an Missing Variables,
parameters should be input to the model. Are there one or more “proxy” Proxy Variables,
used for estimation: variables in the model, which account for some of the Adding Inputs
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models 71
Required data cannot Estimation based on partial inputs; dealing with missing Missing Data in
be collected or does values; inapplicable inputs Scoring Records
not apply: items 8, 9
Certainly, our list of customer needs and questions and our mapping into the
machine-learning domain is not exhaustive. For example, some topics which we have
not addressed but which are equally important to customer validation of data mining
technology are the explanatory power of data mining models [26], the financial impact
of predictive models [27], and information retrieval-related customer needs [28].
Understanding customer concerns is a prerequisite for validating practical,
commercial data mining applications. In some cases, it may be best to address customer
validation needs by analyzing the output of a model, while in other cases it may be
possible to address customer validation needs directly inside of the data-mining
algorithm. For example, it may be possible to build predictive models while
specifically taking into account the sensitivity of the model to variations of the inputs,
as suggested by [29], and [30]. As we understand customer validation needs better,
researchers and practitioners will be able to address the evaluation completeness
requirement better. One method may be to select algorithms and model evaluation
procedures address specific customer validation needs by design.
4. Conclusions
In this paper, we have reviewed customer requirements for the evaluation of data
mining models. Common themes in customer model validation include:
• Sensitivity: the model's response to changes in input values, sign and
magnitude
• Range: the specific range where the models inputs are valid
• Parameters: which of the thousand possible factors does the model
incorporate directly, and which are covered by proxies, and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
72 T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models
It is generally not practical to train customers in data mining validation, and what
we need instead is technology for supporting customer validation in practical terms. It
appears that customers are interested in model-level accuracy, the effect of specific
inputs on model output, as well as in a great variety of domain-specific use cases. The
customer view of model validation is at once very similar and very different from the
data miner’s view, and it is our hope that technologies will evolve that will make it
easy to cross the chasm between the two.
References
[1] Kohavi, R., and Provost, F., January 2001, “Applications of Data Mining to E-commerce” (editorial),
Applications of Data Mining to Electronic Commerce. Special issue of the International Journal Data
Mining and Knowledge Discovery.
[2] Berry, M.J.A., and Linoff, G. 1997. Data Mining Techniques: For Marketing, Sales, and Customer
Support. John Wiley & Sons.
[3] Han, J and Kamber, M. 2005. Data Mining, Second Edition: Concepts and Techniques (The Morgan
Kaufmann Series in Data Management Systems)
[4] Mitchell, T. 1997. Machine Learning. McGraw-Hill Science / Engineering / Math; first edition.
[5] Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
[6] Soukup, T. and Davidson, I. 2002. Visual Data Mining: Techniques and Tools for Data Visualization
and Mining. Wiley.
[7] Witten, I.H., and Frank, E. 2005. Data Mining: Practical machine learning tools and techniques.
Morgan Kaufmann, San Francisco. Second Edition.
[8] Caruana, R., and Niculescu-Mizil, A. 2004, Data Mining in Metric Space: An Empirical Analysis of
Supervised Learning Performance Criteria. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004.
[9] Caruana, R., and Niculescu-Mizil, A., 2006, An Empirical Comparison of Supervised Learning
Algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA.
[10] Bruckhaus, T., Ling, C.X., Madhavji, N.H., and Sheng, S. 2004. Software Escalation Prediction with
Data Mining. Workshop on Predictive Software Models (PSM 2004), A STEP Software Technology &
Engineering Practice.
[11] Chawla, N.V., Japkowicz, N., and Kolcz, A. eds. 2004. Special Issue on Learning from Imbalanced
Datasets. SIGKDD, 6(1): ACM Press.
[12] Domingos, P. 1999. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings
of the Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, ACM Press.
[13] Drummond, C., and Holte, R.C. 2003. C4.5, Class Imbalance, and Cost Sensitivity: Why under-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models 73
[23] Weiss, G., and Provost, F. 2003. Learning when Training Data are Costly: The Effect of Class
Distribution on Tree Induction. Journal of Artificial Intelligence Research 19: 315-354.
[24] Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive Learning by Cost-Proportionate Example
Weighting. In Proceedings of International Conference of Data Mining (ICDM).
[25] Anderson, E. 1935, "The irises of the Gaspé peninsula", Bulletin of the American Iris Society 59, 2-5.
[26] Pazzani, M. J. (2000), „Knowledge Discovery from Data?” IEEE Intelligent Systems, March/April
2000, 10-13.
[27] Bruckhaus, T. 2007. The Business Impact of Predictive Analytics. Book chapter in Knowledge
Discovery and Data Mining: Challenges and Realities with Real World Data. Zhu, Q, and Davidson, I.,
editors. Idea Group Publishing, Hershey, PA
[28] Joachims, T., 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta,
Canada, pp 133 - 142
[29] Engelbrecht, A. P., 2001, “Sensitivity Analysis for Selective Learning by Feedforward Neural
Networks", Fundamenta Informaticae, 45(4), pp 295-328.
[30] Castillo, E., Guijarro-Berdiñas, B., Fontenla-Romero, O., Alonso-Betanzos, A., 2006, A Very Fast
Learning Method for Neural Networks Based on Sensitivity Analysis, Journal of Machine Learning
Research, 7(Jul), pp 1159-1182.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Part 2
Data Mining Applications of Today
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 77
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-77
Abstract. This work focuses on one of the central topics in customer relationship
management (CRM): transfer of valuable customers to a competitor. Customer re-
tention rate has a strong impact on customer lifetime value, and understanding the
true value of a possible customer churn will help the company in its customer rela-
tionship management. Customer value analysis along with customer churn predic-
tions will help marketing programs target more specific groups of customers. We
predict customer churn with logistic regression techniques and analyze the churn-
ing and nonchurning customers by using data from a consumer retail banking com-
pany. The result of the case study show that using conventional statistical methods
to identify possible churners can be successful.
Introduction
This paper will present a customer churn analysis in consumer retail banking sector. The
focus on customer churn is to determinate the customers who are at risk of leaving and
if possible on the analysis whether those customers are worth retaining. A company will
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
therefore have a sense of how much is really being lost because of the customer churn
and the scale of the efforts that would be appropriate for retention campaign.
The customer churn is closely related to the customer retention rate and loyalty.
Hwang et al. [8] defines the customer defection the hottest issue in highly competitive
wireless telecom industry. Their LTV model suggests that churn rate of a customer has
strong impact to the LTV because it affects to the length of service and to the future rev-
enue. Hwang et al. also defines the customer loyalty as the index that customers would
like to stay with the company. Churn describes the number or percentage of regular cus-
tomers who abandon relationship with service provider [8].
Modeling customer churn in pure parametric perspective is not appropriate for LTV
context because the retention function tends to be spiky and non-smooth, with spikes at
the contract ending dates [14]. And usually on the marketing perspective the sufficient
information about the churn is the probability of possible churn. This enables the mar-
1 Corresponding Author, e-mail: teemu.mutanen@vtt.fi
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
78 T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking
keting department so that, given the limited resources, the high probability churners can
be contacted first [1].
Lester explains the segmentation approach in customer churn analysis [11]. She also
points out the importance of the right characteristics studied in the customer churn anal-
ysis. For example in the banking context those signals studied might include decreasing
account balance or decreasing number of credit card purchases. Similar type of descrip-
tive analysis has been conducted by Keveney et al. [9]. They studied customer switch-
ing behavior in online services based on questionnaires sent out to the customers. Gar-
land has done research on customer profitability in personal retail banking [7]. Although
their main focus is on the customers’ value to the study bank, they also investigate the
duration and age of customer relationship based on profitability. His study is based on
customer survey by mail which helped him to determine the customer’s share of wallet,
satisfaction and loyalty from the qualitative factors.
Table 1 presents examples of the churn prediction studies found in literature: the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
analysis of the churning customers have been conducted on various fields. However,
based on our best understanding, no practical studies have been published related to retail
banking sector focused on the difference between continuers and churners.
1. Case study
Consumer retail banking sector is characterized by customers who stays with a company
very long time. Customers usually give their financial business to one company and they
won’t switch the provider of their financial help very often. In the company’s perspective
this produces a stabile environment for the customer relationship management. Although
the continuous relationships with the customers the potential loss of revenue because of
customer churn in this case can be huge. The mass marketing approach cannot succeed in
the diversity of consumer business today. Customer value analysis along with customer
churn predictions will help marketing programs target more specific groups of customers.
In this study a customer database from a Finnish bank was used and analyzed. The
data consisted only of personal customers. The data at hand was collected from time
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking 79
Figure 1. Customer with and without a current account and their average in/out money is presented in different
channels. Legend shows the number of customers that has transactions in each channel. (A test sample was
used n=50 000).
period December 2002 till September 2005. The sampling interval was three months, so
for this study we had relevant data of 12 points of time [t(0)-t(11)]. In logistic regression
analysis we used a sample of 151 000 customers.
In total, 75 variables were collected from the customer database. These variables
are related to the topics as follows: (1) account transactions IN, (2) account transactions
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
OUT, (3) service indicators, (4) personal profile information, and (5) customer level com-
bined information. Transactions have volumes in both in and out channels, out channels
have also frequency variables for different channels.
The data had 30 service indicators in total, (e.g. 0/1 indicator for housing loan), for
example whether customer has housing loan or not. One of these indicators C1 described
the current account. The figure 1 shows the average money volumes in different channels
in two groups of customers on sample 1 as customers are divided into discriminated
based on the current account indicator.
As mentioned previously customers’ value to a company is at the heart of all cus-
tomer management strategy. In retail banking sector the revenue is generated by both
from the margins of lending and investment activities and revenues earned form ser-
vice/transactions/credit card/etc. fees. And as Garland noted [7], retail banking is char-
acterized by many customers (compared to wholesale banking with its few customers),
many of whom make relatively small transactions. This setup in retail banking sector
makes it hard to define customer churn based on customer profitability.
One of the indicators mentioned above, C1 tells whether the customer has a current
account in the time period at hand or not, and the definition of churn in the case study is
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
80 T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking
based on it. This simple definition is adequate for the study and makes it easy to detect
the exact moment of churn. The customers without a C1 indicator before the time period
were not included in the analysis. Their volume in the dataset is small. In banking sector
a customer who does leave, may leave an active customer id behind because bank record
formats are dictated by legislative requirements.
The definition of churn, presented above, produced relatively small amount of cus-
tomers to be considered churners. On average there were less than 0.5% customers in
each time step to be considered churners.
This problem has been identified in the literature under term class imbalance prob-
lem [10] and it occurs when one class is represented by a large number of examples
while the other is represented by only a few. The problem is particularly crucial in an
application, such as the present one, where the goal is to maximize recognition of the mi-
nority class [4]. In this study a down-sizing method was used to avoid all predictions turn
out as nonchurners. The down-sizing (under-sampling) method consists of the randomly
removed samples from the majority class population until the minority class becomes
some specific percentage of the majority class [3]. We used this procedure to produce
two different datasets for each time step: one with a churner/nonchurner ratio 1/1 and the
other with a ratio 2/3.
In this study we use binary predictions, churn and no churn. A logistic regression
method [5] was used to formulate the predictions. The logistic regression model gener-
ates a value between bounds 0 and 1 based on the estimated model. The predictive per-
formances of the models were evaluated by using lift curve and by counting the number
of correct predictions.
2. Results
A collection of six different regression models was estimated and validated. Models were
estimated by using six different training sets: three time periods (4, 6, and 8) with two
datasets each. Three time periods (t = 4, 6, 8) were selected for the logistic regression
analysis. This produced six regression models which were validated by using data sample
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
3 (115 000 customers with the current account indicator). In the models we used several
independent variables, these variables for each model are presented in the table 2. The
number of correct predictions is presented in each model in the table 3. In the validation
we used the same sample with the churners before the time period t=9 removed and the
data for validation was collected from time periods t(9) - t(11).
Although all the variables in each of the models presented in the table 2 were sig-
nificant there could still be correlation between the variables. For example in this study
the variables Num. of transactions (ATM) are correlated in some degree because they
represent the same variable only from different time period. This problem that arises
when two or more variables are correlated with each other is known as multicollinearity.
Multicollinearity does not change the estimates of the coefficients, only their reliability
so the interpretation of the coefficients will be quite difficult [13]. One of the indicators
of multicollinearity is high standard error values with low significance statistics. A num-
ber of formal tests for multicollinearity have been proposed over the years, but none has
found widespread acceptance [13].
It can be seen from the table 2 that all the variables have very little variance between
the models. The only larger difference is in the variableś Number of services value in the
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking 81
Table 2. Predictive variables that were used in each of the logistic regression models. Notion X1 marks for
training dataset with a churner/nonchurner ratio 1/1 and X2 for a dataset with a ratio 2/3. The coefficients of
variable in each of the models are presented in the table.
Model 41 42 61 62 81 82
Constant - - 0.663 - 0.417 -
Customer age 0.023 0.012 0.008 0.015 0.015 0.013
Customer bank age -0.018 -0.013 -0.017 -0.014 -0.013 -0.014
Vol. of (phone) payments in t=i-1 - - - - 0.000 0.000
Num. of trasactions (ATM) in t=i-1 0.037 0.054 - - 0.053 0.062
Num. of trasactions (ATM) in t=i -0.059 -0.071 - - -0.069 -0.085
Num. of transactions (card payments) t=i-1 0.011 0.013 - 0.016 0.020 0.021
Num. of transactions (card payments) t=i -0.014 -0.017 - -0.017 -0.027 -0.026
Num. of transactions (direct debit) t=i-1 0.296 0.243 0.439 0.395 - -
Num. of transactions (direct debit) t=i -0.408 -0.335 -0.352 -0.409 - -
Num. services, (not current account) -1.178 -1.197 -1.323 -1.297 -0.393 -0.391
Salary on logarithmic scale in t=i 0.075 0.054 - - - -
Table 3. Number and % share of the correct predictions (mean from the time periods t=9, 10, 11). In the
validation sample there were a 111 861 cases. The results were produced by the models when the threshold
value 0.5 was used.
Model Number of correct % correct % churners in % true churners
predictions predictions the predicted set identified as churners
model 41 69670 62 0.8 75.6
model 42 81361 72 0.9 60.5
model 61 66346 59 0.8 79.5
model 62 72654 65 0.8 73.4
model 81 15384 14 0.5 97.5
model 82 81701 73 0.9 61.3
models 81 and 82 compared to the value in the rest of the models. Overall behavior in
the coefficients is that the coefficient half year before the churn has a positive sign and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
the coefficient three months before the churn has a negative sign. This indicates that the
churning customers are those that have declining trend in their transaction numbers. Also
a greater customer age and a smaller customer bank age have both positive impacts on
the churn probability based on the coefficient values.
The logistic regression model will generate a value between bounds 0 and 1 as pre-
sented in the chapter 3.1 based on the estimated model. By using a threshold value on
the discrimination of the customers there will be both error types of classification made.
A churning customer could be classified to a nonchurner and a nonchurning customer
could be classified as a potential churner. In the table 3 the number of correct predictions
is presented in each model. In the validation the sample 3 was used with the churners
before the time period t=9 removed.
The values in the table 3 are calculated using a threshold value 0.5. If the threshold
value would be for example set to 1 instead of 0.5 the % correct predictions would be
99.5 because all the predictions would be nonchurners and because there were only 481
churners (0.45%) on average in the validation set. The important result in the table 3 is
the column churners in the predicted set which tells the percentage of the true churners
in the predicted set when the threshold value 0.5 is used.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
82 T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking
100
90
80
70
% churners identified
60
50
40
30
mod 41
mod 42
20
mod 6
1
mod 6
2
10 mod 81
mod 82
0
0 10 20 30 40 50 60 70 80 90 100
% customers identified
Figure 2. Lift curves from the validation-set (t=9) performance of six logistic regression models. Model num-
ber (4, 6, and 8) represents the time period of the training set and (1 and 2) represent the down-sizing ratio.
The important result found in the table 3 is the proportional share of true churners
to be identified as churners by the model. It is also seen in the table that the models
with a good overall prediction performance won’t perform so well in the predictions of
churners. The previously discussed classimbalance problem has an impact here.
The lift curve will help to analyze the amount of true churners that are discriminated
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
in each subset of customers. In the figure 2 the % identified churners are presented based
on each logistic regression models. The lift curves were calculated from the validation
set performance. In the table 3 the models 41 , 61 , and 62 have correct predictions close
to 60% where models 42 and 82 have above 70% of correct predictions. This difference
between the five models has vanished when amount of correct predictions is analyzed in
the subsets as is presented in the figure 2.
3. Conclusions
In this paper a customer churn analysis was presented in consumer retail banking sector.
The different churn prediction models predicted the actual churners relatively well. The
findings of this study indicate that, in case of logistic regression model, the user should
update the model to be able to produce predictions with high accuracy since the inde-
pendent variables of the models varies. The customer profiles of the predicted churners
weren’t included in the study.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Mutanen et al. / Customer Churn Prediction – A Case Study in Retail Banking 83
References
[1] Au W., Chan C.C., Yao X.: A Novel evolutionary data mining algorithm with applications to churn
prediction. IEEE Trans. on evolutionary comp. 7 (2003) 532–545
[2] Buckinx W., Van den Poel D.: Customer base analysis: partial detection of behaviorally loyal clients in a
non-contractual FMCG retail setting. European Journal of Operational Research 164 (2005) 252–268
[3] Chawla N., Boyer K., Hall L., Kegelmeyer P.: SMOTE: Synhetic minority over-sampling technique.
Journal of Artificial Research 16 (2002) 321–357
[4] Cohen G., Hilario M., Sax H., Hugonnet S., Geissbuhler A.: Learning from imbalanced data in surveil-
lance of nosocomial infection. Artificial Intelligence in Medicine 37 (2006) 7–18
[5] Cramer J.S.: The Logit Model: An Introduction. Edward Arnold (1991). ISBN 0-304-54111-3
[6] Ferreira J., Vellasco M., Pachecco M., Barbosa C.: Data mining techniques on the evaluation of wireless
churn. ESANN2004 proceedings - European Symposium on Artificial Neural Networks Bruges (2004)
483–488
[7] Garland R.: Investigating indicators of customer profitability in personal retail banking. Proc. of the Third
Annual Hawaii Int. Conf. on Business (2003) 18–21
[8] Hwang H., Jung T., Suh E.: An LTV model and customer segmentation based on customer value: a case
study on the wireless telecommunication industry. Expert Systems with Applications 26 (2004) 181–188
[9] Keaveney S., Parthasarathy M.: Customer Switching Behaviour in Online Services: An Exploratory
Study of the Role of Selected Attitudinal, Behavioral, and Demographic Factors. Journal of the Academy
of Marketing Science 29 (2001) 374–390
[10] Japkowicz N., Stephen S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6
(2002) 429–449
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
[11] Lester L.: Read the Signals. Target Marketing 28 (2005) 45–47
[12] Mozer M. C., Wolniewicz R., Grimes D.B., Johnson E., Kaushansky H.: Predicting Subscriber Dissat-
isfaction and Improving Retention in the Wireless Telecommunication Industry. IEEE Transactions on
Neural Networks, (2000)
[13] Pindyck R., Rubinfeld D.: Econometric models and econometric forecasts. Irwin/McGraw-Hill (1998).
ISBN 0-07-118831-2.
[14] Rosset S., Neumann E., Eick U., Vatnik N., Idan Y.: Customer lifetime value modeling and its use for cus-
tomer retention planning. Proceedings of the eighth ACM SIGKDD international conference on Knowl-
edge discovery and data mining. Edmonton, Canada (2002) 332-340
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
84 Data Mining for Business Applications
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-84
Abstract.
This paper describes a methodology for the application of hierarchical clustering
methods to the task of outlier detection. The methodology is tested on the problem
of cleaning Official Statistics data. The goal is to detect erroneous foreign trade
transactions in data collected by the Portuguese Institute of Statistics (INE). These
transactions are a minority, but still they have an important impact on the statistics
produced by the institute. The detectiong of these rare errors is a manual, time-
consuming task. This type of tasks is usually constrained by a limited amount of
available resources. Our proposal addresses this issue by producing a ranking of
outlyingness that allows a better management of the available resources by allo-
cating them to the cases which are most different from the other and, thus, have
a higher probability of being errors. Our method is based on the output of stan-
dard agglomerative hierarchical clustering algorithms, resulting in no significant
additional computational costs. Our results show that it enables large savings by
selecting a small subset of suspicious transactions for manual inspection, which,
nevertheless, includes most of the erroneous transactions. In this study we com-
pare our proposal to a state of the art outlier ranking method (LOF) and show that
our method achieves better results on this particular application. The results of our
experiments are also competitive with previous results on the same data. Finally,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
the outcome of our experiments raises important questions concerning the method
currently followed at INE concerning items with small number of transactions.
Introduction
This paper addresses the problem of detecting errors in foreign trade data (INTRASTAT)
collected by the Portuguese Institute of Statistics (INE). The objective is to identify the
transactions that are most likely to contain errors. The selected transactions will then be
manually analyzed by specialized staff and corrected if an error really exists. The effort
required for manual analysis ranges from simply checking the form that was submitted
to a more contacts with the company that made the transaction to confirm whether the
1 Corresponding Author: Luis Torgo, LIAAD/INESC Porto L.A., Rua de Ceuta, 118, 6., 4050-190 Porto,
values declared are the correct ones. In any case, the process requires the involvement of
expensive human resources and has significant costs to INE.
Selected transactions are usually the ones with relatively high/low values because
these affect the official statistics that are published by INE the most. Therefore, this
can be cast as an outlier detection problem. The goal is to detect as many of the errors
as possible. However, this task is constrained by the existence of a limited amount of
expensive human resources for the manual detection of errors. Additionally, the amount
of human resources available for the task varies. In busier periods, these resources have to
dedicate less time to this analysis while in quieter times they can do it in a more thorough
way. These constraints pose interesting challenges to outlier-detection methods. Many
of the methods for these detection tasks provide yes/no answers. We claim that this type
of answers leads to sub-optimal decisions when it comes to manually inspecting the
signalled cases. In effect, if the resources are limited we may well get more signals that
we can inspect. In this case, an arbitrary decision must be done to decide which cases are
to be inspected. By providing a rank of outlyingness instead, the resources can be used
on the cases that have a higher probability of error. This problem occurs in many other
applications, namely in fraud detection tasks.
Previous work on this problem has compared outlier detection methods, a decision
tree induction algorithm and a clustering method [1]. The results obtained with the latter
did not achieve the minimum goals that were established by the domain experts, and,
thus, the approach was dropped. Loureiro et al. [2] have investigated more thoroughly the
use of clustering methods to address this problem, achieving a significant boost in terms
of results. Torgo [3] has recently proposed an improvement of the method described
in [2] to obtain degrees of outlyingness. In this work we apply the method proposed by
Torgo [3] to the INE INTRASTAT data and compare it to other alternatives.
Our method uses hierarchical clustering methods to find clusters with few transac-
tions that are expected to contain observations that are significantly different from the
vast majority of the transactions. Rankings of outlyingness are obtained by exploring the
information resulting from agglomerative hierarchical clustering methods.
Our experiments with the INTRASTAT data show that our proposal is competitive
with previous approaches and also with alternative outlier ranking methods.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Section 1 describes the problem being tackled in more detail as well as the results
obtained previously on this application. We then describe our proposal in Section 2.
Section 3 presents the experimental evaluation of our method and discusses the results
we have obtained. In Section 4 we relate our work with others and finally we present the
main conclusions of this paper in Section 5.
1. Background
In this section we describe the general background, including the problem (Section 1.1)
and previous results (Section 1.2), that provide the motivation for this work.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
86 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
• Item id,
• Weight of the traded goods,
• Total cost,
• Type (import/export),
• Source, indicating whether the form was submitted using the digital or paper ver-
sion of the form,
• Form id,
• Company id,
• Stock number,
• Month,
• Destination or source country, depending on whether the type is export or import,
respectively.
At INE, the data are inserted into a database. Figure 1 presents an excerpt of a report
produced with data concerning import transactions from 1998 of item with id 101, as
indicated by the field labeled “NC”, below the row with the column names.2
Figure 1. An excerpt of the INTRASTAT database. The data were modified to preserve confidentiality.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Errors often occur in the process of filling forms. For instance, an incorrectly intro-
duced item id will associate a transaction with the wrong item. Another common mis-
take is caused by the use of incorrect units like, for instance, declaring the weight in tons
instead of kilos. Some of these errors have no effect on the final statistics while others
can affect them significantly.
The number of transactions declared monthly is in the order of tens of thousands.
When all of the transactions relative to a month have been entered into the database,
they are manually verified with the aim of detecting and correcting as many errors as
possible. In this search, the experts try to detect unusual values on a few attributes. One
of these attributes is Cost/Weight, which represents the cost per kilo and is calculated
using the values in the Weight and Cost columns. In Figure 1 we can see that the values
for Cost/Weight in the second and last transactions are much lower than in the others.
The corresponding forms were analyzed and it was concluded that the second transaction
is, in fact, wrong, due to the weight being given in grams rather than kilos, while the last
one is correct.
2 Note that, in 1998, the Portuguese currency was the escudo, PTE.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 87
The goal of this project is to reduce the time spent on this task by automatically
selecting a subset of the transactions that includes almost all the errors that the experts
would detect by looking at all the transactions. According to INE experts, to be mini-
mally acceptable the system should select less than 50% of the transactions containing
at least 90% of the errors. However, as stated earlier, given that human resources are
quite expensive, the smaller the number of transactions, the better. Additionally, the same
people are involved in other tasks in INE and sometimes are not available to evaluate
INTRASTAT transactions. Therefore, the number of transactions that can be manually
analyzed varies over different months.
Finally, we note that computational efficiency is not important because the automatic
system will hardly take longer than half the time the human expert does.
Different approaches were tried on this problem. Several months worth of transaction
from 1998 and 1999 were used. The data were provided in the form of two files per
month, one with the transactions before being analyzed and corrected by the experts, and
the other obtained after that process. The integration of the information from the two files
proved much harder than could be expected. Some of the problems found were:
• difficulty in determining the primary key of the tables, even with the help of the
experts;
• some transactions existed in one of the files but not in the other;
• incomplete information, sometimes because it was not filled in the forms, others
due to the reporting software (e.g., values below a given threshold were consid-
ered too low and not printed in the report).
Some of the problems were handled by eliminating the corresponding records, while oth-
ers were simply ignored because they were not expected to affect the data significantly.
This meant that, as it is common in data mining projects, most of the time was spent in
data preparation [4].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Four very different methods were applied. Two come from statistics and are univari-
ate techniques: box plot [5] and Fisher’s clustering algorithm [6]. The third one, Knorr
& Ng’s cell-based algorithm [7], is an outlier detection algorithm which, despite being a
multivariate method, was used only on the Cost/Weight attribute. The last is C5.0 [8], a
multivariate technique for the induction of decision trees.
Although C5.0 is not an outlier detection method, it obtained the best results. This
was achieved with an appropriate transformation of the variables and by assigning dif-
ferent costs to different errors. As a result, 92% of the errors were detected by analyzing
just 52% of the transactions. However, taking advantage of the fact that C5.0 can output
the probability of each case being an outlier, the transactions were ordered by this proba-
bility. Based on this ranking of transactions in terms of their probability of being an error,
it was possible to detect 90% of the errors by analyzing the top 40% of the transactions.
The clustering approach based on Fisher’s algorithm was selected because it finds
the optimal partition for a given number of clusters of one variable. It was applied to all
the transactions of an item, described by a single variable, Cost/Weight. The transactions
assigned to a small cluster, that is, a cluster containing significantly fewer points than
the others, were considered outliers. The distance function used was Euclidean and the
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
88 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
number of clusters was k = 6. A small cluster was defined as a cluster with fewer points
than half the average number of points in the k clusters. The method was applied to data
relative to two months and selected 49% of the transactions which included 75% of the
errors, which did not accomplish the goals set by the domain experts.
Further work based on clustering methods was carried out by Loureiro et al. [2],
who have proposed a new outlier detection method based on the outcome of agglomer-
ative hierarchical clustering methods. Again, this approach used the size of the resulting
clusters as indicators of the presence of outliers. The basic assumption was that outlier
observations, being observations with unusual values, would be distant (in terms of the
metric used for clustering) from the “normal” and more frequent observations, and there-
fore would be isolated in smaller clusters. In [2], several settings concerning the clus-
tering process were explored and experimentally evaluated on the INTRASTAT prob-
lem. The best setup met the requirements of human experts (inspecting less than 50%
of transactions enabled finding more than 90% of the errors), by detecting 94.1% of the
errors by inspecting 32.7% of the transactions. In spite of this excellent result, the main
drawback of this approach is the fact that it does not allow a control over the amount of
inspection effort we have available. For instance, if 32.7% is still too much for the hu-
man resources currently available we face the un-guided task of deciding which of these
transactions will be inspected. The work presented on this paper tries to overcome this
practical limitation.
Clustering algorithms can be used to identify outliers as a side effect of the cluster-
ing process (e.g. [9]). Most clustering methods rely on a distance metric and thus can
be seen as distance-based approaches to outlier detection [7]. However, iterative meth-
ods like hierarchical clustering algorithms (e.g. [10]) can also handle different density
regions, which is one of the main drawbacks of distance-based approaches. In effect, if
we take agglomerative hierarchical clustering methods, for instance, they proceed in an
iterative fashion by merging two of the current groups (which initially are formed by sin-
gle observations) based on some criterion that is related to their proximity. This decision
is taken locally, that is for each pair of groups, and takes into account the density of these
two groups only. This merging process results in a tree-based structure usually known as
a dendrogram. The merging step is guided by the information contained in the distance
matrix of all available data. Several methods can be used to select the two groups to be
merged at each stage. Contrary to other clustering approaches, hierarchical methods do
not require a cluster initialization process that would inevitably spread the outliers across
many different clusters thus probably leading to a rather unstable approach. Based on
these observations we have explored hierarchical clustering methods for detecting both
local and global outliers [2].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 89
In this paper we present an approach that takes advantage of the dendrogram gen-
erated by hierarchical clustering methods to produce a ranking of outlyingness. This ap-
proach was first described in [3] and is also based on agglomerative clustering methods.
Informally, the idea behind our proposal is to use the height (in the dendrogram) at which
any observation is merged into a group of observations as an indicator of its outlying-
ness. If an observation is really an outlier this should only occur at later stages of the
merging process, that is the observation should be merged at a higher level than “normal”
observations. More formally, we set the outlyingness factor of any observation as,
h
OFH (x) = (1)
N
where h is the level of the hierarchy H at which the case is merged,3 and N is the number
of training cases (which is also the maximum level of the hierarchy by definition of the
hierarchical clustering process).
One of the main advantages of our proposal is that we can use a standard hierarchi-
cal clustering algorithm to obtain the OFH values without any additional computational
cost. This means our proposal has a time complexity of O(N 2 ) and a space complex-
ity of O(N ) [11]. We use the hclust() function of the statistical software environment
R [12], which is based on Fortran code by F. Murtagh [13]. This function includes in its
output a matrix (called merge) that can be used to easily obtain the necessary values for
calculating directly the value of OFH according to Equation 1.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Figure 2.(a) shows an artificial data set with two marked clusters of observations
with very different density. As it can be observed there are two clear outliers: observa-
tions 1 and 12. While the former can be seen as a global outlier, the latter is clearly a
local outlier. In effect, it is only regarded as an outlier because of the high density of its
neighbors, as it is in effect nearer observation 2 than, say the 14th from the 15th. How-
ever, as these two latter are in a less compact region their distance is not regarded as a
signal of outlyingness. This is a clear example of a data set with both global and local
outliers and we would like our method to clearly signal both 1 and 12 as observations
with a high probability of being outliers.
Figure 2.(b) shows the dendrogram obtained by using an agglomerative hierarchical
clustering algorithm. As it can be seen, both 1 and 12 are the last observations to be
individually merged into some cluster. As such, it does not come as a surprise that when
running our method on this data we get the top 5 outliers shown on Table 1.
In spite of this success, this method has serious problems when facing compact
groups of outliers. In effect, if we have a data set where there are a few outliers that
3 Counting from bottom up.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
90 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
8
1
20
0.8 21
6
13 19
22
0.6
14 18
4
Height
15 17
y
0.4
2
16
0.2
0
12
1
2 9
10311
12
56
17
18
19
20
15
16
13
14
21
22
2
4
9
7
6
8
3
10
5
11
0.0
4 78
x Cases
are very similar to each other, they will be merged with each other very quickly (i.e.,
at a low level of the hierarchy) and thus will have a very low OFH value despite being
outliers. Figure 3 illustrates the problem. For this data set, the method ranks observations
9 and 10, which are clear outliers, as the least probable outliers (they are in effect the
first to be merged). This problem is particulary important in our application and also in
fraud detection. In both cases, it is often true that the interesting observations are not
completely isolated from all the others. They sometimes stem from a behavior which,
although rare, is systematic (e.g., a company always declares transactions in counts rather
than in kilos).
1.0
9 5 21 2 3
1 22
10 6 74 8
0 1 2 3 4 5 6
0.8
Height
0.6
y
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
0.4
12 15 11
11
0.2
13 14 16
20
3
8
5
6
1817
12
13
4
7
14
18
15
16
1
2
17
19
9
10
19 20
x Cases
The example of Figure 3 shows a clear failure of our initial proposal. The failure
results from considering only the height at which individual observations are merged and
not groups of observations. When there is a small group of similar observations that is
quite different from others, such that it could make sense to talk about a set of outliers,
they will only be merged with other groups at later stages but they will merge with each
other very early in the process. Therefore, our proposal will not consider this as a sign of
outlyingness of the members of that group. Still, the general idea of our proposal remains
valid so we need to generalize it for these situations. We can do this by assigning a value
similar to that of Equation 1 to all members of the smallest group of any merge that
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 91
Table 2. Outlier ranking for the example of Figure 3 using our new proposal.
Rank CaseID OFH
1 9 0.8100
2 10 0.8100
3 11 0.8075
4 15 0.6300
5 16 0.6300
occurs along the hierarchical clustering process. However, we should reinforce this value
with some size-dependent factor (i.e., the smaller the group, the more probable that its
elements are outliers). Formally, for each merge of a group gs with a group gl , where
|gs | < |gl |, we set the outlier factor of the members of gs as,
0
if |gs | > t
OF (gs ) = |gs | h (2)
1− N × N if |gs | < t
where |gs | is the cardinality of the smallest group, gs , t is a threshold that indicates the
number of observations above which a group can not be regarded as a set of outliers for
the data set, and h is the level of the hierarchy where the merge occurs. The OF value
of the larger group gl is set to zero. The value of OF ranges from zero to one, and it is
maximum when a single observation is merged at the last level of the hierarchy.
Any observation can belong to several groups along its upwards path through the
dendrogram. As such, it will probably get several of these scores at different levels. We
set the outlyingness factor of any observation as the maximum OF score it got along
its path through the dendrogram. By proceeding this way we are in effect enabling the
method to detect local outliers, which at some merging stage might have got a very high
score of OF because they are clear outliers with respect to some group that they have
merged with, even though at higher levels of the hierarchy (i.e., seen more globally),
they might not get such high OF values. This means that the outlyingness factor of an
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
observation is given by
3. Experimental Evaluation
This section describes a series of experiments designed with the goal of checking the
performance of our method on the INTRASTAT data set. We have compared our OFH
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
92 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
method with our previous approach [2] and also with the state of the art in terms of
obtaining degrees of outlyingness: the LOF method [14].
The INTRASTAT data set has some particularities that lead to an experimental
methodology that incorporates some of the experts’ domain knowledge so that the
methodology better meets their requirements.
We start by describing the measures used to assess the quality of the results (Sec-
tion 3.1), then we discuss the experimental setup (Section 3.2), the algorithms that were
tested (Section 3.3) and finally we discuss the results (Section 3.4).
In order to evaluate the validity of the resulting methodology we have taken advantage
of the fact that the data set given to us had some information concerning erroneous trans-
actions. In effect, all transactions that were inspected by the experts and were found to
contain errors, were labeled as such. Taking advantage of this information we were able
to evaluate the performance of our methodology in tagging these erroneous transactions
for manual inspection. The experts were particularly interested in two measures of per-
formance: Recall and Percentage of Selected Transactions, which are discussed next.
Recall (%R) can be informally defined in the context of this domain as the propor-
tion of erroneous transactions (as labeled by the experts) that are selected by our models
for manual inspection. Ideally, our models should select a set of transactions for manual
inspection that included all the transactions that were previously labeled by the experts
as errors. However, taking into consideration the difficulty of the problem, INE experts
established the value of 90% as the minimum acceptable recall.
Regarding the percentage of selected transactions (%S) this is the proportion of
all the transactions that are selected for manual inspection by the models. This statistic
quantifies the savings in human resources achieved by using the methodology: the lower
this value the more manual effort is saved. INE experts defined 50% as the maximum
admissible value for this statistic. Given the fact that our method outputs a ranking of
outlyingness we can easily control the value of this measure. The user can decide which
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
percentage of transactions he/she wants to check and then use the ranking provided by
our method to select the transactions corresponding to the selected percentage. Given that
50% is the maximum value and that it is important to release human resources for other
tasks and that the available resources vary, in our experimental evaluation, we have col-
lected results for four different percentage of selected transactions: 35%, 40%, 45% and
50%. All of these settings satisfy the requirements established by the experts concerning
this measure.
An important issue that must be taken into account when analyzing the value of
recall (%R) is the quality of the labels assigned to transactions. When a transaction
is labeled as an error, this classification is reliable because it means that the experts
have analyzed the transaction and found an error. However, since not all transactions are
analyzed, there may be some that are labeled as “normal" but are, in fact, errors. Many
of these are transactions that were actually detected by the experts but, because they are
not expected to affect the trade statistics which are computed based on these data, are not
corrected. However, it is possible that some significant errors are missed by the experts.
Here, we will not address this issue and simply focus on selecting the errors that were
detected by the domain experts.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 93
Table 3. The “base” results just for including the items with less than 10 transactions.
According to INE experts, the items should be inspected separately due to the rather
diverse distribution of the prices of the products. For instance, the variation of values for
rice is smaller than for heavy machinery. As such we have applied our algorithm to the
set of transactions of each item in turn.
Our outlier ranking method is designed for multivariate analysis. However, follow-
ing another suggestion from the domain experts we have focused our study of the IN-
TRASTAT data set in a single variable, Cost/W eight. Domain experts give particular
attention to this variable as they believe it is the most efficient variable for detecting the
important errors.
Given that INE processes the data on a monthly basis we have decided to use this
very same logic in our tests. This methodology will also enable us to compare our results
with the results obtained in [1], where the same strategy was followed.
One final constraint has an important effect on the results. According to INE ex-
perts, all items with very few transactions, referred to as infrequent items, must be set
for manual inspection. This reduces the number of transactions that the outlier detection
methods may, in fact, select. The domain experts defined 10 as the minimum number
of transactions required for an item to be classified as infrequent. As shown in Table 3,
this fact alone has a big impact on the process. The number of transactions that can be
selected by the outlier detection method is not 50%, as originally established, but ranges
from 15% to 35% (approx.). Furthermore, the concentration of errors in the infrequent
items are generally higher than in the others but not that much higher. In the selected
transactions, the number of errors found represents between 25% and 40% (approx.) of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
all the errors. Considering, for instance, the month of Jan/1998, the items with less than
10 transactions represent 35.7% of all the transactions and contain 35.4% of the errors.
This means, that, to achieve the target of 90% of Recall, the outlier detection method
needs to find almost 55% of the errors by selecting less than 15% of the transactions, to
stay within the maximum effort tolerated by INE experts, which is 50%.
The experimental methodology that we have used is better described by Algorithm 1.
This algorithm calculates the value of the Recall for each month of the testing period,
given a certain desired human effort (given by a provided %S).
3.3. Algorithms
Using Algorithm 1 we have collected the performance, in terms of Recall, of our pro-
posed method and also of the LOF method.
The clustering-based outlier detection method proposed here (Section 2) has several
parameters. The first is the agglomeration method used with the hclust() function. In
our experiments we have tested several alternative: the ward, single, complete, average,
mcquitty, median and centroid methods. Another parameter of our method is the distance
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
94 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
Require: D, P ercS D is the data set, P ercS is the %S selected by the user
Ensure: %R The vector of %R’s for each month
function used. For this parameter we have experimented with both the euclidean and
camberra functions. Finally, our method also requires the specification of a limit on the
size of a group in order to be selected as a group of (potential) outliers (the t threshold
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
in Equation 2). The possible combinations of these settings makes up for a total of 14
variants of our method.
With respect to LOF, we have used the implementation of this algorithm that is
available in the R package dprep [15]. We have also experimented with 14 variants of
this method, namely by varying the number of neighbours used by the method from 2 to
28 in steps of 2.
In our graphs of results we also plot the %S and %R value of the method described
in [2], which is denoted in the graphs as “LTS04”. This method is not an outlier ranking
algorithm. It simply outputs the (unordered) set of transactions it judges as being outliers,
which leads to a single pair of %S and %R values. In this case the user is not able to
adjust the %S value to the available resources. By chance, in none of the testing months
the 50% limit of selected transactions was surpassed but with this type of methods there
is not such guarantee. In months when the available resources are not sufficient to analyze
all the transactions selected, the experts must decide which ones to let aside. Additionally,
in the months when the number of transactions that could be analyzed by the available
resources is greater than the number of selected transactions, the experts must arbitrarily
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 95
0.4
● 0.3
● ●
● 0.2
% Selected
0.4
●
0.3 ● ●
●
0.2
% Recall
select further transactions to check. For the “LTS04” method the same 14 variants used
with OFH were tried.
3.4. Results
Figure 4 shows the results of our comparative experiments in terms of recall (%R) and
percentage of selected transactions (%S) for each of the 8 available testing months. For
each of the methods we have always reported the best result of the 14 variants that were
tried. These can thus be regarded as the best possible outcome of these methods. Each
graph in the figure represents a month. All graphs have two dotted lines indicating the
experts requirements (at least 90% recall and at most 50% selected transactions). This
means that for each graph the best place to be is the bottom right corner (maximum
%R and minimum %S). Still, the most important statistic is Recall as long as we do
not overcome the 50% limit. The four points for both OFH and LOF represent the
four previously selected working points in terms of %S. Still, we should recall that both
methods would be better represented by lines as any other working points could have
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
been selected. Some of the points are not shown on some graphs because the respective
method achieved a very poor score that is outside of the used axes limits.
The results of our experiments (cf. Figure 4) clearly indicate that our method is
competitive with a state of the art outlier ranking method, LOF. This confirms previous
results on a different set of applications [3]. Moreover, our method is always able to
fulfil the minimum requirement of 90% recall, which is not always the case with LOF.
Compared to “LTS04”, both OFH and LOF lose a few times in terms of achieving the
same %R for the same level of %S. Still, we should recall that “LTS04” provides no
flexibility in terms of available human resources and thus it can happen (as for instance
in Jun/1998) that the solution provided by this method does not attain the objectives of
the experts or even that it is not feasible because it requires too many resources.
As discussed in Section 3.2, the results presented in Figure 4 include all transactions
from infrequent items, i.e, items with less than 10 transactions. An analysis of Figure 4
taking into account the impact of infrequent items (cf. Table 3), raises an important ques-
tion. In effect, the decision of inspecting infrequent items was “imposed” by the INE
experts. However, by looking at our results we think this decision is rather question-
able. For instance, in Jan/1998 the inclusion of the small items incurred in a “cost” of
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
96 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
%S = 35.7%, whilst only allowing us to detect 35.4% of the errors. By simply adding
10% more transactions, our method (OF h.45) was able to boost the recall to 95%. Now
the question is: is it really necessary to analyze all the transactions in infrequent items?
The small amount of data makes the outlier detection method proposed here inappro-
priate for these items. However, it may is possible to use some other form of statistical
decision method to reduce the amount of transactions from infrequent items to analyze.
Our results clearly indicate that statistical-based outlier detection methods are able to do
a much better job than this brute force approach. Therefore, if we can reduce the amount
of effort required for infrequent items, then more resources can be dedicated to analyzing
transactions selected by the outlier detection method proposed here.
A lesson that can be learned from this observation is that not all domain-specific
knowledge is useful. However, addressing the problem of using automatic methods to
select transactions in infrequent items is not just a technical challenge, caused by the
small volume of data. If we are able to successfuly detect outliers in these items, the next
challenge will be to convince the experts to change their beliefs.
4. Related Work
Outlier detection is a well studied topic (e.g. [16]). Different approaches have been taken
to address this task. Distribution-based approaches (e.g. [17,18]) assume a certain para-
metric distribution of the data and signal outliers as observations that deviate from this
distribution. The main drawbacks of these approaches lie on the constraints of the as-
sumed distributions. Depth-based methods (e.g. [19]) are based on computational ge-
ometry and compute different layers of k-d convex hulls and then represent each data
point in this space together with an assigned depth. In practice these methods are too
inefficient for dealing with large data sets. Knorr and Ng [7] introduced distance-based
outlier detection methods. These approaches generalize several notions of distribution-
based methods but still suffer from several problems, namely when the density of the data
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
points varies (e.g. [14]). Density-based local outliers [20,14] are able to find this type of
outliers and are the appropriate setup whenever we have a data set with a complex distri-
bution structure. These authors defined the notion of Local Outlier Factor (LOF) for each
observation, which naturally leads to the notion of outlier ranking. The key idea of this
work is that the notion of outlier should be “local” in the sense that the outlier degree of
any observation should be determined by the clustering structure in a bounded neighbor-
hood of the observation. In Section 3 we have seen that our method compares favorably
with the LOF algorithm on the problem of detecting errors in portuguese foreign trade
transactions.
Other authors have looked at the problem of outliers from a supervised learning per-
spective (e.g. [1,21]). Usually, the goal of these approaches is to classify a given obser-
vation as being an outlier or as a “normal” case. These approaches are typically affected
by the problem of unbalanced classes that occurs in outlier detection applications, be-
cause outliers are, by definition, much less frequent than the “normal" observations. If
adequate adjustments are not made, this kind of class distribution usually deteriorates the
performance of the supervised models [22].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods 97
5. Conclusions
In this paper we have presented a method for obtaining a ranking of outlyingness using an
hierarchical clustering approach. This method uses the height at which cases are merged
in the clustering process as the key factor for obtaining a degree of outlyingness.
We have applied our methodology to the task of detecting erroneous foreign trade
transactions in data collected by the Portuguese Institute of Statistics (INE). The results
of the application of our method to this problem clearly met the performance criteria
outlined by the human experts. Moreover, our results outperform previous approaches to
this same problem. Compared to these previous approaches, our method provides a result
that allows a flexible management of the available human resources for the manual task
of inspecting the potential erroneous transactions.
Our results have also revealed a potential inefficiency on the process used by INE to
handle the items with a small number of transactions. In future work we plan to address
these items in a way that we expect to further improve our current results.
Acknowledgements
This work was partially funded by FCT projects oRANKI (PTDC/EIA/68322/2006) and
Rank! (PTDC/EIA/81178/2006) and by a sabbatical grant from the Portuguese govern-
ment to L. Torgo. We would like to thank INE for providing the data used in this study.
References
[1] C. Soares, P. Brazdil, J. Costa, V. Cortez, and A. Carvalho. Error detection in foreign trade data using
statistical and machine learning methods. In N. Mackin, editor, Proc. of the 3rd International Conference
on the Practical Applications of Knowledge Discovery and Data Mining, pages 183–188, 1999.
[2] A. Loureiro, L. Torgo, and C. Soares. Outlier detection using clustering methods: a data cleaning ap-
plication. In Malerba D. and May M., editors, Proceedings of KDNet Symposium on Knowledge-based
Systems for the Public Sector, 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
[3] L. Torgo. Resource-bounded fraud detection. In Neves et. al, editor, Proceedings of the 13th Portuguese
Conference on Artificial Intelligence (EPIA’07), LNAI, pages 449–460. Springer, 2007.
[4] Shichao Zhang, Chengqi Zhang, and Qiang Yang and. Data preparation for data mining. Applied
Artificial Intelligence, 17(5 & 6):375 – 381, May 2003.
[5] J.S. Milton, P.M. McTeer, and J.J. Corbet. Introduction to Statistics. McGraw-Hill, 1997.
[6] W.D. Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Association,
53:789–798, 1958.
[7] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In
Proceedings of 24rd International Conference on Very Large Data Bases (VLDB 1998), pages 392–403.
Morgan Kaufmann, 1998.
[8] R. Quinlan. C5.0: An Informal Tutorial. RuleQuest, 1998.
http://www.rulequest.com/see5-unix.html.
[9] R. Ng and J. Han. Efficient and efective clustering method for spatial data mining. In Proc. of VLDB’94,
1994.
[10] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,
New York, 1990.
[11] F. Murtagh. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics
Quarterly, 1:101–113, 1984.
[12] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, 2008. ISBN 3-900051-07-0.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
98 L. Torgo and C. Soares / Resource-Bounded Outlier Detection Using Clustering Methods
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 99
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-99
Abstract. This paper presents an integrated system that helps both retail companies
and electricity consumers on the definition of the best retail contracts and tariffs.
This integrated system is composed by a Decision Support System (DSS) based
on a Consumer Characterization Framework (CCF). The CCF is based on data
mining techniques, applied to obtain useful knowledge about electricity consumers
from large amounts of consumption data. This knowledge is acquired following an
innovative and systematic approach able to identify different consumers’ classes,
represented by a load profile, and its characterization using decision trees. The
framework generates inputs to use in the knowledge base and in the database of the
DSS. The rule sets derived from the decision trees are integrated in the knowledge
base of the DSS. The load profiles together with the information about contracts
and electricity prices form the database of the DSS. This DSS is able to perform
the classification of different consumers, present its load profile and test different
electricity tariffs and contracts. The final outputs of the DSS are a comparative
economic analysis between different contracts and advice about the most economic
contract to each consumer class. The presentation of the DSS is completed with
an application example using a real data base of consumers from the Portuguese
distribution company.
Introduction
The full liberalization of most of the electricity markets in Europe and around the world
creates a new environment were several retail companies compete for the electricity sup-
ply of end users. According to [1] the development of decision support tools is of major
importance to show consumers the potential savings they can get by assuming a more
active participation in the electricity markets. Also a company with the opportunity to
trade energy to consumers, in a competitive scenario, must make many decisions and
evaluations to attain the best tariff structure and to study the portfolio of contracts to
offer to consumers. Making these decisions and correctly evaluating the options is a
time consuming process. Another important point is the progressive replacement of tra-
ditional electricity meters for real-time meters, the amounts of data collected will grow
in an exponential manner. The development of frameworks and tools able to extract use-
ful knowledge from this huge volumes of data and use it in decision support, will be a
1 Contact author: School of Engineering, Polytechnic Institute of Porto, Rua Dr António Bernardino de
competitive advantage for the electricity retails and an important step to get a more ac-
tive demand side participation. The freedom to define a larger portfolio of contracts by
retail companies, and the freedom of consumers to choose between different contracts
and different companies, increases the need of decision support tools to help both sides.
This scenario motivates the development of a decision tool for the selection of the best
electricity retail contract, which is presented in this paper. It is possible to find in the lit-
erature some previous works dedicated to this problem. In [2] data mining techniques are
applied to the problem of load profiling. In [3] a load research project followed by load
profiling is presented and the results of this work are used to support tariff definition. In
[4] consumers classes and its load profiles are defined by clustering techniques and the
results are used to study different contracts for producers. In [5], a framework for the
automatic classification and characterization of electricity consumers is presented, which
is able to deal with large amounts of data and perform the classification of different con-
sumers according to their load profiles. This paper is organized as follows: in section 1, a
description of consumer characterization framework is made, section 2 presents the data
mining module, in section 3, the DSS is described, and in section 4, a practical example
is presented. Finally, in section 5 we present the conclusions.
The knowledge about how and when consumers use electricity is essential to develop an
efficient DSS. This knowledge must be obtained from historical data and must be up-
dated to follow the changes on consumer’s behavior. To generate this kind of knowledge
and keep it regularly updated, a comprehensive methodology was developed. Due to the
large amount of data predicted to be available in the future and the need for easy updat-
ing, the CCF provides a clear separation of different steps that include various Data Min-
ing techniques. The proposed framework is based on the study of previous load profiling
projects [6,7] and on the structure of the KDD process [8].
In the cleaning phase we check for inconsistencies in the data and outliers are removed.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Anomalous consumption values and outages are detected and replaced based on the in-
formation of similar days. This type of error represent 1% of the total data. In the prepro-
cessing phase missing values are detected and replaced using regression techniques. Lin-
ear regression is used to estimate numerical attributes like missing values of measures,
and logistic regression is used to estimate nominal attributes like the missing commercial
information (about 6% of data), such as activity type, tariff type. The regression models
permit make the substitution of values with a 95% of confidence. With this procedure the
major problems encountered are minimized and the initial data set is clean and complete.
Next, we divide data into subsets. This is made using previous knowledge about the load-
ing conditions, like the season of the year and the type of weekday (working days or
weekends) affect electricity consumption. Data is separated, according to the different
loading conditions, in smaller data sets. We have two data sets representing each season
of the year, one for working days and another for weekends. To obtain a more effective
data reduction, without losing important information, the data from each individual con-
sumer is reduced. This is based on the reduction of the measured daily load diagrams,
corresponding to each loading condition, to one representative load diagram. These rep-
resentative load diagrams are obtained elaborating the data from the measurement cam-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition 101
paign. For each consumer, the representative load diagram is built by averaging the mea-
sured load diagrams using a spreadsheet. Each consumer is then described by one single
representative load diagram in each data set, for the different loading conditions (see fig-
ure 1). The diagrams are computed using the field-measurements values, so they need to
be brought together to a similar scale for the purpose of their pattern comparison. This is
achieved through normalization. For each consumer the vector of the representative load
diagram was normalized to the [0-1] range using the peak power of the representative
load diagram. This kind of normalization allows maintaining the shape of the curve and
permits comparing the consumption patterns.
The application of data mining techniques is made using one isolated technique or com-
bining several techniques, to build models able to find relevant knowledge about the dif-
ferent consumption patterns found in data. The implementation of the models involves
several steps, like attribute selection, fitting the models to the data and evaluating the
models. This will be described in the next section.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
The data mining module for consumer characterization is based on the combination of
unsupervised and supervised learning techniques. After the data pre-processing and re-
duction phase, each consumer is described by its representative load diagram and the
commercial indexes used by the distribution company. The representative daily load di-
agram of the mth consumer is the vector l(m) = {l1(m) , ..., lh(m) } where lh(m) are the
normalized values of the instant power consumed in the instant h and h = 1, .., H with
H = 96, representing the 15 minutes interval between the collected measurements.
The commercial indexes available are of contractual nature (i.e., activity type, con-
tracted power, tariff type, supply voltage level). The distribution company, to classify its
clients, defines these indexes a priori.
The proposed module is divided in two main sub-modules according to the task they ad-
dress: segmentation and targeting. In the first sub-module unsupervised learning, based
on clustering techniques, is used to obtain a partition of the initial sample into a set of
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
102 F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition
consumer clusters. These clusters represent the different consumption patterns existing
in the available sample. Each of these clusters is represented by its load profile. In the
second sub-module, supervised learning (using decision trees) is used to describe each
cluster by a rule set and create a classification model able to assign consumers to the
existing clusters. The first sub-model is important to the determination and actualization
of the load profiles, and the classification model is also important as new data is collected
to the assignment of new consumers to the existing consumer classes.
The load profiling sub-module’s goal is the partition of the initial data sample in a set of
clusters defined according to the load shape of the representative load diagrams of each
consumer. This is made assigning to the same cluster consumers with the most similar
behavior, and to different clusters consumers with dissimilar behavior. The first step of
the module development was the selection of the most suitable attributes to be used by
the clustering model. To obtain the best separation between the classes it is important
to use the most detailed information about the shape of the consumers’ load diagrams.
The vectors with the normalized representative load diagrams are the best option. The
number of clusters is an input of the model so it must be defined based on a criterion that
leads to an adequate selection. The number of clusters obtained by the clustering module
was defined from the electricity company, which determined a minimum number of 6
and a maximum number of 9 classes. To define the number of classes, several clustering
operations were performed to study the evolution of the clusters compactness using the
measure Mean Index Adequacy (MIA) presented in [4]. The following distances 1 and 2
are defined to assist the formulation of the adequacy measure:
1. Distance between two load diagrams
H
1
d(li , lj ) = (li (h) − lj (h))2 (1)
H
h=1
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
2. Distance between a representative load diagram and the center of a set of dia-
grams is the geometric mean distance between r(k) and each l(m) element
n
1
d(r(k) , L(k) ) = d(r(k) , l(m) )2 (2)
n(k) m=1
Let us consider a set of M load diagrams separated in k clusters with k = 1, ..., K were
K is the total number of clusters, and each cluster is formed by a subset C (k) of load
diagrams, where r(k) is a pattern assigned to cluster k. The MIA is defined by:
K
1
M IA = d(r(k) , C (k) )2 (3)
K
k=1
The smaller values of MIA indicate more compact clusters. The k-means algorithm was
used to study the cluster tendency of the data set based on the MIA measure. The obtained
results are presented in Figure 2.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition 103
It is possible to see that 9 clusters would be the best choice, considering the indica-
tion of the distribution company and the evolution of the MIA, because for more then 9
clusters the improvement on the clusters compactness, represented by the decrease of the
MIA values, is not very relevant.
The selection of the most suitable clustering algorithm is described in [6] and was
based on a comparative analysis of the performance of different algorithms. Several al-
gorithms were tested performing different clustering operations. The best results are ob-
tained with a combination of a self-organizing map (SOM) [9] with the classical k-means
algorithm [10]. This combination operates in two levels. In the first level the SOM is
used to obtain a reduction of the dimension of the initial data set. The SOM performs the
projection of the H-dimensional space, containing the M vectors representing the load
diagrams of the consumers in the initial data set, into a bi-dimensional space. Two co-
ordinates, representing the SOM attributes in the bi-dimensional space, are assigned to
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
each client. At the end of the first level the initial data set is reduced to the number of
winning units in the output layer of the SOM, represented by its weight vectors. This set
of vectors is able to keep the characteristics of the initial data set and achieve a reduction
of its dimension. In the second level the k-means algorithm is used to group the weight
vectors of the SOM’s units and the final clusters are obtained. The use of the k-means in
the second level allows the definition of the number of clusters as an input of the model.
This combination is very interesting for large data sets, very common in data mining
problems. The SOM has good performance with large data sets and is able to process
large amounts of data, reducing this data to a smaller data set. During the comparative
analysis it was possible to conclude that the k-means algorithm presented a very good
performance with data sets with continuous attributes, like the ones we are using, but this
algorithm presents limitations with large data sets. The combination of both algorithms
was able to solve these limitations and create a solution able to deal with large data sets.
Testing both solutions we were able to conclude that the results obtained were similar,
which proves the effectiveness of the proposed combination. The load profiles for each
class are obtained by averaging the representative load diagrams of the consumers as-
signed to the same cluster.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
104 F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition
Pav,day
Load Factor d1 = Pmax,day
1 day
1 Pav,night
Night Impact d3 = 3 Pav,day
1 day (8 hours night, from 11p.m to 7a.m)
1 Pav,lunch
Lunch Impact d5 = 8 Pav,day
1 day (3 hours day, from 12.00 to 15:00)
were proposed in [11] and we selected the most relevant, presented in Table 1.
The classification module uses supervised learning, based on the knowledge about
the relation between the characteristics of the consumer and its corresponding class, ob-
tained with the clustering operation. The model’s goal attribute is the consumer class ob-
tained by the clustering module. The load shape indexes are computed, for each group of
consumers using the representative load diagrams. In order to obtain a reduction of the
range of values assumed by these indexes, and treat them as nominal attributes, they are
replaced by a small number of distinct categories using an interval equalization method
[12]. This method obtains intervals with different sizes, we choose them so that approxi-
mately the same number of consumers falls in each one, to minimize the loss of informa-
tion, due to the replacement of the indexes by a set of discrete categories. This implies a
smaller loss of information, because all the classes will be consider in the model. Each
interval is a class label. This also allows us to treat the load shape indexes in the same
manner we treat the commercial indexes. The classification model inputs are the com-
mercial and the load shape indexes, for each class load profile. The classification algo-
rithm used is the C5.0 [13]. This algorithm was selected because it provides interpretable
models, is adequate to work with nominal attributes and does not require long training
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition 105
times to estimate so it presents good performances with large data sets as the ones used
in data mining. The model evaluation is performed using ten-fold cross validation. This
will increase the computational effort but improves the model’s accuracy, the proportion
of true results (both true positives and true negatives) in the population. The classifica-
tion model creates a complete characterization of consumers’ classes based on the most
relevant attributes selected by the model. This model will be the knowledge base of the
DSS.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
The implementation of the DSS started with the definition of a clear separation between
data storage, knowledge organization and the program that will perform the calculations
to obtain the best contract. This separation is very important to have a flexible DSS able
to deal with large amounts of data [14]. The DSS must easily permit the introduction of
contractual consumer characteristics and the most relevant attributes on the description
of consumer’s behavior. These input parameters will start a decision process that ends
with the presentation of the most adequate contract (the best contract). This decision is
completed by a comparative analysis between all the options available to make clearer the
choice of the best and the potential gain obtained with the change. The internal structure
of the DSS (figure 4) is composed by:
• a knowledge base where the rules sets obtained by the CCF are stored;
• a data base where all the information about load profiles and electricity tariff
structures available, or being tested, in the company are gathered;
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
106 F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition
• a working memory that contains all the information about the contract that is
either supplied by the user or inferred by the system during the session;
• an inference engine that matches the facts contained in the working memory with
the knowledge contained in the knowledge base, to draw conclusions;
• a user interface to easily collect consumer characteristics as inputs, start the deci-
sion process and present the results in a clear way.
The rules that define and characterize the client’s load profiles are stored in the knowl-
edge base. The rules created in the CCF are described by the load shape indexes: load
factor index (d1), night impact index (d3), and the contracted power. These parameters
are the data input necessary to the DSS to calculate the best contract. For the new clients,
in the beginning this information will be obtained by question them about how many
and type of electronic devices they have, how many people are at home, and so on. To
get an easier and more practical data manipulation by the DSS, we have simplified the
input parameters into a set of intervals. The user doesn’t need to put a specific value for
each parameter, but a categorical value like: low, medium, high, very high and ultra high.
This kind of simplification was necessary to allow the practical application of the DSS to
small consumers without real-time meters. While the electricity meters are not replaced
by real-time meters a low-voltage (LV) consumer does not know the exact value of its
load factor or night impact. On the other hand it is possible, based on simple informa-
tion about consumption habits to predict if these indexes are in the categories presented
bellow. This simplification will lose some precision but will permit the application of the
DSS system on LV consumers. The load factor and the night impact were classified into
the following discretized intervals.
d1 ≥ 0, 6 → ultrahigh d3 ≥ 0, 6 → ultrahigh
The resulting knowledge base is much simpler, easy to manipulate and permits to per-
form a classification on low voltage consumers. Next we present as an example the rule
set obtained for winter-working days classes using a data base of Portuguese consumers.
To each one of these classes we have a different load profile that will be used as reference
to the calculations of costs and observe potential savings.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition 107
The knowledge base is composed by different rule sets corresponding to different load
conditions as winter weekends and working days and summer weekends and working
days. Different calculations are performed for winter or summer.
The information gathered in this database is composed by the load profiles obtained by
the CCF and the tariff structures corresponding to different contracts to be used in the
economical study. This data base comprehends different types of contracts based on the
Portuguese regulated tariffs for 2004, like Fixed Rate (FR), Time-of-Use (TOU) Con-
tracts. These contracts have different prices for peak and off peak hours and consider the
possibility of a weekly based plan, different cycle for working days and weekends (TOU-
WC), or a daily plan (TOU-WDC), same plan for all the week days; Tailored Contracts
(TC); and Real-Time-Pricing (RTP). The DSS can be easily adjusted to perform differ-
ent test simulation to different tailored contracts with different profit levels. Besides the
retailer profit these contracts can also include an insurance factor considering the level
of risk shared by both parts involved in the contract.
3.3. Interface
An interactive and easy to use interface was developed to allow the user to interact with
the DSS. In this interface the inputs necessary to characterize each client are the load
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
factor, the night impact and the contract power. After this information is introduced the
DSS search the load profile more adequate to the client. Next, the DSS performs the
calculations for the available contracts of this load profile, and presents the electricity
costs for the different contractual structures. After that it is possible to choose the most
economical contract. Figure 4 presents the interface with a simulation case as example.
4. Simulation Results
The DSS was tested and validated running a large number of simulations for differ-
ent possible clients and the results obtained were very satisfactory. The DSS is flexible
enough to study different types of contracts and adjust them in a very simple way. The
system is able to deal with large amounts of data and expandable to work with real time
actualization of the database. We present as example the results obtained by the DSS.
This is based on a simulation using the following consumer characteristics:
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
108 F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition
Available contracts for this load profile in figure 4). For consumers with high load factor
(d1) and low night impact (d3) the most economical contract is the proposed Tailored
Contract (TC), followed by the Real-Time Pricing (RTP). The other existent contracts are
more expensive. TOU-WC and TOU-WDC present the same value for this type of client,
and FR is the most expensive. From this example we can conclude that the electricity
prices existing at the moment can be improved. This simulation is repeated for a large
number of different situations with different input factors, and RTP and TC are always
the best option.
5. Conclusions
A robust and flexible DSS for the study, test and selection of the most adequate electricity
contract is presented. This DSS uses as inputs the most relevant load shape indexes and
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
F. Rodrigues et al. / An Integrated System to Support Electricity Tariff Contract Definition 109
the contracted power of a consumer and provides as outputs the consumer load profile,
an economical comparative analysis of different contracts and finally the decision about
the most adequate contract. The knowledge base of the DSS is the result of a CCF based
on Data Mining techniques developed to extract useful knowledge from large amounts of
consumers’ data. Its database is composed by the different classes’ load profiles and the
contracts that are being tested. The DSS is tested and validated with a real data case from
the Portuguese distribution company. The results obtained after running a large number
of simulations are very satisfactory. This DSS is useful both for retail companies and for
electricity consumers, because it helps to define the contract that best fits with the client
load profile.
Acknowledgements
References
[1] D. Kirchen, Demand-Side View of Electricity Markets, IEEE Transactions on Power Systems, vol=18,
No 2, pp.520-526,May 2003.
[2] B. Pitt and D. Kirchen, Application of Data Mining Techniques to Load Profiling, IEEE Transactions on
Power Systems, May 1999.
[3] C. Chen, J.C. Hwang and C.W. Huang, Application of Load Survey to Proper Tariff Design, IEEE
Transactions on Power Systems, vol=12 No. 4, pp.1746-1751, November 1997.
[4] G. Chicco, R. Napoli, P. Postulache, M. Scutariu and C. Toader, Customer Characterization Options for
Improving the Tariff Offer, IEEE Transactions on Power Systems, vol=18, No 1, pp.381-387, February
2003.
[5] V. Figueiredo, F. Rodrigues, Z. Vale and B. Gouveia, An Electric Energy Characterization Framework
based on Data Mining Techniques, IEEE Transactions on Power Systems, vol=20 No. 2, pp.596-602,
May 2005.
[6] F. Rodrigues, V. Figueiredo, J. Duarte and Z. Vale, A Comparative Analysis of Clustering Algorithms
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Applied to Load Profiling, Lecture Notes in Artificial Intelligence (LNAI 2734), pp.73-85, Springer-
Verlag 2003.
[7] V. Figueiredo,J. Duarte, F. Rodrigues, Z. Vale and Borges Gouveia, Electric Energy Customer Charac-
terization by Clustering, Proceedings of ISAP 2003, Lemnos, Greece.
[8] U. Fayyad, G. Piatetsky-Shapiro, P.J. Smith and R. Uthurasamy, From data Mining to Knowledge Dis-
covery: An Overview, in Advances in Knowledge Discovery and Data Mining, pp.1-34, AAAI/MIT
Press 1996.
[9] T. Kohonen, Self-Organisation and Associative Memory, 3rd Ed. Springer-Verlag, 1989.
[10] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[11] M. Ernoult and F. Meslier, Analysis and Forecast of Electrical Energy Demand, Revue Général dŠElec-
tricité, Vol. 4, pp.381-387, 1982.
[12] I. Witten and E. Frank, Data Mining Ű Practical Machine Learning Tools and Techniques with Java
implementations, Morgan Kaufmann Publishers, Academic Press 2002.
[13] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
[14] Efraim Turban and Jay Aranson, Decision Support Systems and Intelligent Systems, Prentice Hall 1998.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
110 Data Mining for Business Applications
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-110
Abstract.
Hospitals are adept at capturing large volumes of highly multi-dimensional data
about their activities including clinical, demographic, administrative, financial and,
increasingly, outcome data (such as adverse events). Managing and understanding
this data is difficult as hospitals typically do not have the staff and/or the expertise
to assemble, query, analyse and report on the potential knowledge contained within
such data. The Power Knowledge Builder (PKB) project investigated the adaption
of data mining algorithms to the domain of patient costing, with the aim of helping
practitioners better understand their data and therefore facilitate best practice.
Introduction
Hospitals are driven by the twin constraints of maximising patient care while minimising
the costs of doing so. For public hospitals in particular, the overall budget is generally
fixed and thus the quantity (and quality) of the health care provided is dependent on the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
1 We are indebted to the staff at PowerSolutions Pty Ltd with whom this suite was developed as part of the
PSD system (http://www.powerhealthsolutions.com/). The research was funded, in part, by an
AusIndustry START grant.
2 Corresponding Author. School of Computer Science, Engineering and Mathematics, Flinders University,
Adelaide, South Australia 5001. Email: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 111
As a part this resource pressure, hospitals are often unable to have teams of analysts
looking across all their data, searching for useful information such as trends and anoma-
lies. For example, typically the team charged with managing the patient costing system,
which incorporates a large data repository, is small. These staff may not have a strong
statistical/epidemiological background or the time or tools to undertake complex multi-
dimensional analysis or data mining. Much of their work is in presenting and analysing
a set of standard reports, often related to the financial signals that the hospital responds
to (such as cost, revenue, length of stay or casemix). Even with OLAP tools and report
suites it is difficult for the users to look at more than a small percentage of the available
dimensions (usually related to the known areas of interest) and to undertake some ad hoc
analysis in specific areas, often as a result of a targeted request, e.g. what are the cost
drivers for liver transplants?
Even disregarding the trauma of an adverse patient outcome, adverse events can be
expensive in that they increase the clinical intervention required, resulting in higher-than-
average treatment costs and length-of-stay, and can also result in expensive litigation.
Unfortunately, adverse outcomes are not rare. A study by Wolff et al. [1] focusing on rural
hospitals estimated that .77% of patients experienced an adverse event while another by
Ehsani et al., which included metropolitan hospitals, estimated a figure of 6.88% [2].
The latter study states that the total cost of adverse events ... [represented] 15.7% of the
total expenditure on direct hospital costs, or an additional 18.6% of the total inpatient
hospital budget. Given these indicators, it is important that the usefulness of data mining
techniques in reducing artefacts such as adverse effects is explored.
A seminal example of data mining use within the hospital domain occurred during
the Bristol Royal Infirmary inquiry of 2001 [3] in which data mining algorithms were
used to create hypotheses regarding the excessive number of infant deaths at the Bristol
Royal Infirmary that underwent open-heart surgery. In a recent speech, Sir Ian Kennedy
(who lead the original inquiry) said, with respect to improving patient safety, that The
[current] picture is one of pockets of activity but poor overall coordination and limited
analysis and dissemination of any lessons. Every month that goes by in which bad, unsafe
practice is not identified and rooted out and good practice shared, is a month in which
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
more patients die or are harmed unnecessarily. The roll of data mining within hospital
analysis is important given the complexity and scale of the analysis to be undertaken.
Data mining can provide solutions that can facilitate the benchmarking of patient safety
provision, which will help eliminate variations in clinical practice, thus improving patient
safety.
The Power Knowledge Builder (PKB) project provides a suite of data mining ca-
pabilities, tailored to this domain. The system aims to alert management to events or
items of interest in a timely manner either through automated exception reporting, or
through explicit exploratory analysis. The initial suite of algorithms (trend analysis, re-
source analysis, outlier detection and clustering) were selected as forming the core set of
tools that could be used to perform data mining in a way that would be usable to educated
users, but without the requirement for sophisticated statistical knowledge.
To our knowledge, PKB’s goal is unique – it is industry specific and does not require
specialised data mining skills, but aims to leverage the data and skills that hospitals al-
ready have in place. There are other current data mining solutions, but they are typically
part of a more generic reporting solutions (i.e. Business Objects, Cognos) or sub-sets of
data management suites such as SAS or SQL server. These tools are frequently powerful
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
112 A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite
and flexible, but are not targeted to an industry, and to use them effectively requires a
greater understanding of statistics and data mining methods than our target market gen-
erally has available. This paper introduces the PKB suite and its components in Section
1. Section 2 discusses some of the important lessons learnt, while Section 3 presents the
current state of the project and the way forward.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
The PKB suite is a core set of data mining tools that have been adapted to the patient
costing domain. The initial algorithm set (anomaly detection, trend analysis and resource
analysis), was derived through discussion with practitioners, focusing upon potential ap-
plication and functional variation. Subsequently, clustering and characterisation algo-
rithms were appended to enhance usefulness. The current application prototype is writ-
ten in Java 1.4 for compatibility purposes and is multi-threaded to allow for multiple
concurrent analysis instances. The tool, an example of which is presented in Figure 1, is
feature rich, providing algorithms for a number of data mining tasks including clustering,
characterisation and outlier detection.
Each algorithmic component has an interface wrapper, which is subsequently incor-
porated within the PKB prototype. The interface wrapper provides effective textual and
graphical elements, with respect to pre-processing, analysis and presentation stages, that
simplifies both the use of PKB components and the interpretation of their results. This is
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 113
important as the intended users are hospital administrators, not data mining practitioners
and hence the tools must be usable by educated users, without requiring sophisticated
statistical knowledge.
Outlier (or anomaly) detection is a mature field of research with its origins in statistics
[4]. Current techniques typically incorporate an explicit distance metric, which deter-
mines the degree to which an object is classified as an outlier. A more contemporary
approach incorporates an implied distance metric, which alleviates the need for the pair-
wise comparison of objects [5,6] by using domain space quantisation to enable distance
comparisons to be made at a higher level of abstraction and, as a result, obviates the need
to recall raw data for comparison.
The PKB outlier detection algorithm CURIO contributes to the state of the art in
outlier detection, through novel quantisation and object allocation, that enables the dis-
covery of outliers in large disk resident datasets in two sequential scans [7]. Furthermore
CURIO addresses the need realised during this project of the algorithm to discover not
only outliers but also outlier clusters. By clustering similar (close) outliers and present-
ing cluster characteristics it becomes easier for users to understand the common traits
of similar outliers, assisting the identification of outlier causality. An outlier analysis
instance is presented in Figure 2, showing the interactive scatterplot matrix and cluster
summarisation tables.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
114 A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite
B B
D D
G EF G EF
A A
C C
Outlier detection has the potential to find anomalous information that is otherwise
lost in the noise of multiple variables. Hospitals are used to finding (and in fact expect
to see) outliers in terms of cost of care, length of stay etc. for a given patient cohort.
What they are not so used to finding are outliers over more than two dimensions, which
can provide new insights into the hospital activities. The outlier component presents pre-
processing and result interfaces, incorporating effective interactive visualisations that
enable the user to explore the result set, and see common traits of outlier clusters through
characterisation.
Given CURIO’s cluster based foundations, clustering (a secondary component) is
a variation of CURIO that effectively finds the common clusters rather than anomalous
clusters. Given their common basis, both the outlier and clustering components require
the same parameters and use the same type of result presentations. By clustering sim-
ilar (close) outliers and presenting cluster characteristics it becomes easier for users to
understand the common traits of similar outliers, assisting the identification of outlier
causality. Although proposed as an area of further work by Knorr [8], the realisation of
this functionality is novel and enhances the utility of the CURIO algorithm [7].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Given the simple premise that outliers are distant from the majority of objects when
represented in Euclidean space, if this κ-D space is quantised, outliers are those objects
in relatively sparse cells, where the degree of relative sparsity is dictated by some tol-
erance T . Given T = 4, Figure 3 presents a 2-dimensional grid illustrating both poten-
tial (grey) and valid (white labelled) outlier objects. However this simple approach can
validate false positives as indicated by A, which is on the edge of a dense region. This
problem can be resolved by either creating multiple offset grids or by undertaking a NN
(Nearest Neighbour) search. Multiple offset grids requires the instantiation of many grids
which are slightly offset from each other. This effectively alters the cell allocation of ob-
jects, and requires a subsequent voting system to determine if an object is to be regarded
as an outlier. An alternative NN search explores the bounding cells of each identified po-
tential outlier cell and if the number of objects within this neighbourhood exceeds T , all
objects residing within the cell are eliminated from consideration. Both techniques were
investigated and the neighborhood search was found to be more accurate, and hence is
the one presented.
This overarching theory provides the foundation of CURIO, enabling disk resident
datasets to be analysed in two sequential scans. The quantisation and subsequent count
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 115
based validation effectively discovers outliers indicating that explicit distance threshold
is not required and in fact often needlessly complicates the discovery process. CURIO
incorporates an implied distance metric through cellular granularity, where the finer the
granularity the shorter the implied distance threshold. This precision parameter P effec-
tively quantises each dimension into 2 P equal length intervals. For example, Figure 4
illustrates the effect of increasing P by 1 (effectively doubling the number of partitions),
resulting in potentially more outliers.
1.2. Characterisation
Characterisation allows users to understand more deeply the nature of their data. For
example, a cluster of high-cost cases may be found on a certain day and characterisation
allows users to investigate why this phenomenon occurs – perhaps there is some latency
associated with weekend admissions.
Characterisation (also a secondary component) was initially developed as a sub-
sidiary for outlier and clustering analysis in order to present descriptive summaries of
the clusters to the users. However it is also present as an independent tool within the
suite. The characterisation algorithm provides this descriptive cluster summary by find-
ing the sets of commonly co-occurring attribute values within the set of cluster objects.
To achieve this, a partial inferencing engine, similar to those used in association min-
ing [9] is used. The engine uses the extent of an attribute value’s (elements) occurrence
within the dataset to determine its significance and subsequently its usefulness for sum-
marisation purposes. Once the valid elements have been identified, the algorithm deep-
ens finding progressively larger, frequently co-occurring elements sets from within the
dataset.
Given the target of presenting summarised information about a cluster, the valid ele-
ments are those that occur often within the source dataset. While this works well for non-
ordinal (and other low cardinality) data as ordinal data requires partitioning into ranges,
allowing the significant mass to be achieved. This is accomplished by progressively re-
ducing the number of partitions, until at least one achieves a significant volume. Given
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
the range 1 to 100, and initial set of 26 partitions are formed, if no partition is valid, each
pair of partitions are merged, by removing the lowest significant bit, (25 partitions). This
process continues until a significant mass is reached. This functionality is illustrated in
Figure 2 through the presentation of a summarisation table with ordinal-ranges.
Resource usage analysis is a domain specific application that provides a tool that analyses
patterns of resource use for patient episodes (hospital stays). This novel algorithm is
backed by an extended inferencing engine [9], that provides dynamic numeric range
partitioning, temporal semantics and affiliated attribute quantisation, to provide a rich
analysis tool. Furthermore the tool enables the association of extraneous variables with
the resource patterns such as average cost and frequency. The resource usage results
are presented as a set of sortable tables, where each table relates to a specified dataset
partition. For example, the user can specify the derivation of daily resource usage patterns
for all customers with a particular Diagnosis Related Group, partitioned by consulting
doctor. By associating average cost and frequency with these patterns, useful information
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
116 A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite
one patient took place on day one, while for another patient in the same cohort it took
place on day two? can also be addressed. The resource analysis component automates
and simplifies what would have previously been very complex tasks for costing analysts
to perform.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 117
2. Lessons Learned
The PKB project began as a general investigation into the application of data mining
techniques to patient costing software, between Flinders University and PowerHealth
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Solutions, providing both an academic and industry perspective. Now, 18 months on from
its inception, many lessons have been learnt that will hopefully aid both parties in future
interaction with each other and with other partners. From an academic viewpoint, issues
relating to the establishment of beta test sites and the bullet proofing of code is unusual.
While from industry the meandering nature of research and the potential for somewhat
tangential results can be frustrating. Overall, three main lessons have been learnt.
Solution looking for a problem? It is clear that understanding data and deriving usable
information and insights from it is a problem in hospitals, but how best to use the
research and tools is not always clear. In particular, the initial project specification
was unclear as to how this would be achieved. As the project evolves it is crys-
tallising into a tool suite that complements PowerHealth Solution’s current report-
ing solution. More focus upon the application of the PKB suite from the outset
would have sped up research but may also have constrained the solutions found.
Educating practitioners. The practical barriers to data mining reside more in the struc-
turing and understanding of the source data than in the algorithms themselves. A
significant difficulty in providing data mining capabilities to non-experts is the re-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
118 A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite
quirement for the users to be able to collect and format source data into a usable
format. Given knowledge of the source data, scripts can easily be established to ac-
complish the collection process. However where the user requires novel analysis,
an understanding of the required source data is required. It is possible to abstract
the algorithmic issues away from the user, providing user-friendly GUI’s for re-
sult interpretation and parameter specification, however this is difficult to achieve
for source data specification, as the user must have a level of understanding with
respect to the nature of the required analysis in order to adequately specify it.
Pragmatics. The evaluation of the developed tools requires considerable analysis, from
both in-house analysts and analysts from third parties who have an interest in the
PKB project. The suite is theoretically of benefit, with many envisaged scenarios
(based upon experience) where it can deliver useful results, but it is difficult to find
beta sites with available resources.
The second version of the PKB suite is now at beta-test stage, with validation and further
functional refinement required from industry partners. The suite currently consists of a
set of fast algorithms with relevant interfaces that do not require special knowledge to
use. Of importance in this stage is feedback regarding the collection and pre-processing
stages of analysis and how the suite can be further refined to facilitate practitioners in
undertaking this.
The economic benefits of the suite are yet to be quantified. Expected areas of benefit
are in the domain of quality of care and resource management. Focusing upon critical
indicators, such as death rates and morbidity codes, in combination with multiple other
dimensions (e.g. location, carer, casemix and demographic dimensions) has the potential
to identify unrealised quality issues.
Three immediate areas of further work are evident: the inclusion of extraneous
repositories, knowledge base construction and textual data mining. The incorporation of
extraneous repositories such as meteorological and socio-economic within some analy-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
sis routines can provide useful information regarding causality. While the incorporation
of an evolving knowledge base will facilitate analysis by either eliminating known in-
formation from result sets or the flagging of critical artefacts. As most hospital data is
not structured, but contained in notes, descriptions and narrative the mining of textual
information will also be valuable.
References
[1] Wolff, A.M., Bourke, J., Campbell, I.A., Leembruggen, D.W.: Detecting and reducing hospital adverse
events: outcomes of the wimmera clinical risk management program. Medical Journal of Australia 174
(2001) 621–625
[2] Ehsani, J.P., Jackson, T., Duckett, S.J.: The incidence and cost of adverse events in victorian hospitals
2003-04. Medical Journal of Australia 184 (2006) 551–555
[3] Kennedy, I.: Learning from Bristol: The report of the public inquiry into children’s heart surgery at the
Bristol Royal Infirmary 1984-1995. Final report, COI Communications (2001)
[4] Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches. Signal Processing
83 (2003) 2481–2497
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 119
[5] Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In Gupta, A.,
Shmueli, O., Widom, J., eds.: 24th International Conference on Very Large Data Bases, VLDB’98, New
York, NY, USA, Morgan Kaufmann (1998) 392–403
[6] Papadimitriou, S., Kitagawa, H., Gibbons, P., Faloutsos, C.: LOCI: Fast outlier detection using the local
correlation integral. In: 19th International Conference on Data Engineering (ICDE), Bangalore (2003)
315–326
[7] Ceglar, A., Roddick, J.F., Powers, D.M.: CURIO: A fast outlier clustering algorithm for large datasets.
In Ong, K.L., Li, W., Gao, J., eds.: Second International Workshop on Integrating AI and Data Mining
(AIDM 2007). Volume 84 of CRPIT., Gold Coast, Australia, ACS (2007) 37–45
[8] Knorr, E.: Outliers and Data Mining: Finding Exceptions in Data. PhD thesis, University of British
Columbia (2002)
[9] Ceglar, A., Roddick, J.F.: Association mining. ACM Computing Surveys (2006)
[10] Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer, NY (1987)
[11] Keogh, E.K., Chakrabarti, K. Mehorta, S., Pazzani, M.: Locally adaptive dimensionality reduction for
indexing large time series databases. In: ACM SIGMOD International Conference on Management of
Data, Santa Barbara, CA (2001) 151–162
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Part 3
Data Mining Applications of Tomorrow
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 123
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-123
Introduction
analytical sequence in the data mining process is to first identify clusters in the data,
assess robustness, interpret them and later train a classifier to assign new cases to the
respective clusters.
The present study applies several unsupervised clustering techniques to a highly dis-
puted area in criminology i.e. the existence of criminal offender types. Many contem-
porary criminologist argue against the possibility of separate criminal types [5] while
others strongly support their existence (see [2,6]). Relatively few studies in criminology
have used Data Mining techniques to identify patterns from data and to examine the ex-
istence of criminal types. To date, the available studies have typically used inadequate
cross verification techniques, small and inadequate samples and have produced incon-
sistent or incomplete findings, so that it is often difficult to reconcile the results across
these studies. Often, the claims for the existence of “criminal types” have emerged from
psychological or social theories that mostly lack empirical verification. Several attempts
have been made to integrate findings from available classification studies. While these
efforts have suggested some potential replications of certain offender types they have
1 Corresponding Author: Department of Computer Science, University of Colorado at Boulder, 430
been limited by their failure to provide clear classification rules e.g. a psychopathic cat-
egory has emerged from a large clinical literature but there is much dispute over how to
identify them, what specific social and psychological causal factors are critical, whether
or not this type exists among female offenders or among adolescents and whether there
are “sub-types” of psychopaths. Thus, a current major challenge in criminology is to ad-
dress whether reliable patterns or types of criminal offenders can be identified using data
mining techniques and whether these may replicate the criminal profiles as described in
the prior criminological literature.
In the present study Juvenile offenders (N = 1572) from three U.S. state juvenile
justice systems were assessed on a battery of criminogenic risk and needs factors as well
as official criminal histories. Data mining techniques were applied with the goal of iden-
tifying intrinsic patterns in this data set and to assess whether these replicate any of the
main patterns previously proposed in the criminological literature [6]. The present study
thus aimed to identify empirical patterns within this juvenile justice population and to
examine how they relate to certain theorized patterns from the prior criminological lit-
erature. The implications of these findings for Criminology are manifold. The findings
firstly suggest that certain offender patterns can be reliably identified using several data
mining unsupervised clustering techniques. Secondly, the findings appear to offer a chal-
lenge to those criminological theorists who hold that there is only one general “global
explanation" of criminality as opposed to multiple pathways with different explanatory
models (see [7]).
From a methodological perspective the present paper illustrates some of the diffi-
cult analytical problems encountered in applied criminological research that mainly stem
from the kind of data encountered in this field. A first major problem is that the data is
noisy and often unreliable. Second, the empirical clusters are not clear cut so that cases
range from strongly classified to poorly classified boundary cases with only weak cluster
affiliations. Certain cases may best be seen as hybrids (close to cluster boundaries) or
outliers. Additionally, some distortion of clusters can be a problem since many cluster-
ing algorithms assign a label to every point in the data, including outliers. Such “forc-
ing” of ill-fitting members - both hybrids and outliers cases - may distort the quality and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
interpretation of the clustering results. Standard methods such as K-Means will assign
cases to the closest cluster center no matter how “far away” from the cluster centers
the points are. While other algorithms such as EM-Clustering [8] output probabilities of
class-membership, the elimination of outliers in a unsupervised setting is a hard problem
in this area of applied research. In this context we acknowledge that much work has been
done to make clustering more robust against outliers, such as using clustering ensembles
[9,10] or combining the results of different clustering methods [11], but a clear challenge
is to develop effective methods to eliminate cases in an aggressive way to obtain refined
clustering solutions, i.e. in removing points that are not “close enough” to the cluster
center.
Thus, in this research we demonstrate a methodology to identify well clustered
cases. Specifically we combined a semi-supervised technique with an initial standard
clustering solution. This process identified several highly replicated offender types with
clear definitions of reliable and core criminal patterns. Of substantive criminological in-
terest we found that these replicated clusters provided social and psychological profiles
that have a strong resemblance to certain of the criminal types proposed in prior litera-
ture by leading criminologists [6,2]. However, the present findings go beyond these prior
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 125
typological proposals firstly by grounding the type descriptions in clearly defined em-
pirical patterns. Secondly they allow explicit classification rules for each of the offender
type that have been generally absent from the prior criminological literature.
1. Method
Juvenile offenders (N = 1572) from three state systems were assessed on a battery of
criminogenic risk and needs factors using the Youth COMPAS assessment instrument
described in [12] and their official criminal histories. The scales measured assess various
areas of risk and desistance such as relationship with the youths’ family, school, sub-
stance abuse, aggressive behaviors, abuse, social-economic situation and social factors.
We started with a provisional initial solution obtained “manually” using standard
K-means and Ward’s minimum-variance method. These approaches have been the pre-
ferred choice in many social and psychological studies to find hidden or latent typologi-
cal structure in data [13,14].
Despite its success standard K-means is vulnerable to data that do not conform to the
minimum-variance assumption or expose a manifold structure, that is, regions (clusters)
that may wind or straggle across a high-dimensional space. The initial K-means clusters
were also vulnerable to remaining outliers or noise in the data. Thus, we proceeded with
two additional methods designed to deal more effectively with these outlier and noise
problems.
Bagging has been used with success for many classification and regression tasks [15]. In
the context of clustering, bagging generates multiple classification models from bootstrap
replicates from the selected training set and then integrates these into one final aggregated
model. By using only two-thirds of the training set (with some cases repeated) to create
each model, we aimed to achieve models that should be fairly uncorrelated so that the
final aggregated model may be more robust to noise or any remaining outliers inherent
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
We will now briefly summarize the semi-supervised labeling method proposed in [4].
Given a set of points X ∈ Rn×m and labels L = {1, · · · , c}. Let xi denote the ith ex-
ample. Without loss of generality the first l points (1 · · · l) are labeled and the remaining
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
126 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
points (l + 1 · · · n) unlabeled. Define Y ∈ N n×c with Yij = 1 if point xi has label j and
0 otherwise. Let F ⊂ Rn×c denote all the matrices with nonnegative entries. A matrix
F ∈ F is a matrix that labels all points xi with a label yi = arg maxj≤c Fij . Define the
series F (t + 1) = αSF (t) + (1 − α)Y with F (0) = Y, α ∈ (0, 1). The entire algorithm
is defined as follows:
1. Form the affinity matrix Wij = exp(−xi − xj 2 /(2σ 2 )) if i = j and 0 other-
wise. σ determines how fast the distance function
n decays.
−1/2 −1/2
2. Compute S = D WD with Dii = j=1 Wij and Dij = 0, i = j.
3. Compute the limit of series limt→∞ F (t) = F ∗ = (I − αS)−1 Y . α ∈ (0, 1)
limits how much the information spreads from one point to the other.
4. Label each point xi as arg maxj≤c Fij∗ .
The regularization framework for this method follows. The cost function associated
with the matrix F with regularization parameter μ > 0 is defined as
⎛ ⎞
n 2 n
1 1 1
Q(F ) = ⎝ Wij √ Fi − Fj + μ Fi − Yi 2 ⎠ (1)
2 i,j=1 Dii Djj i=1
The first term is the smoothness constraint that associates a cost with change between
nearby points. The second term, weighted by μ, is the fitting constraint that associates
a cost for change from the initial assignments. The classifying function is defined as
1 μ
F ∗ = arg minF ∈F Q(F ). Differentiating Q(F ) one obtains F ∗ − 1+μ SF ∗ − 1+μ Y.
1 μ
Define α = 1+μ and β = 1+μ (note that α + β = 1 and the matrix (I − αS) is
non-singular) one can obtain
F ∗ = β (I − αS)−1 Y (2)
For a more in-depth discussion about the regularization framework and how to ob-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
To tackle the problem of hybrid case elimination we use a voting methodology to elim-
inate cases in which different algorithms produce a disagreement similar to [11] that
combines hierarchical and partitioning clusterings.
In this paper we adopted the following solution: First, we use Bagged K-Means
[16] to get a stable estimate of our cluster centers in the presence of outliers and hybrid
cases. To eliminate cases that are far away from the cluster centers, we use the centers
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 127
5 5 5
0 0 0
−5 −5 −5
0 5 0 5 0 5
Consistency Iteration t=1 Consistency Iteration t=5 Consistency Iteration t=10
10 10 10
5 5 5
0 0 0
−5 −5 −5
0 5 0 5 0 5
Consistency Iteration t=20 Consistency Iteration t=50 Consistency Iteration t=150
10 10 10
5 5 5
0 0 0
−5 −5 −5
0 5 0 5 0 5
Figure 1. Consistency Method: two labeled points per class (big stars) are used to label the remaining unla-
beled points with respect to the underlying cluster structure. F ∗ denotes the convergence of the series.
5
5
4 4
3 3
2 2
1 1
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Hybrid Cases
5 5
4
4
3 3
2 2
1 1
Figure 2. Toy example: Three Gaussians with hybrid cases in between them. Combining the labels assigned
by K-Means (top, left) and the Consistency Method (top, right; bottom, left) with two different σ results in the
removal of most of the hybrid cases (bottom, right) by requiring consensus between all models build.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
128 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
in a semi-supervised setting with the consistency method [4] to obtain a second set of
labels. These labels from the semi-supervised method are obtained with a completely
different similarity measure than the K-Means labels. K-Means assigns labels by using
the distance to the cluster center (Nearest Neighbor) and works best given clusters that
are Gaussian. The semi-supervised consistency method assigns labels with respect to
the underlying intrinsic structure of the data and follows the shape of the cluster. The
semi-supervised labeling method minimizes (1) while K-Means attempts to minimize
the intra-cluster variance for each Cluster Ci and the respective cluster-mean μi , i.e.
k
V = xj − μi 2 ∀Ci .i = 1 . . . k (3)
i=1 xj ∈Ci
These two fundamentally different methods of label assignments are more likely to dis-
agree the farther away the point is from the cluster center. We eliminate cases in which
the labels do not agree. Note that the consistency method has been demonstrated to work
well on high-dimensional data such as images. On the other hand it has been demon-
strated that assignments of labels using Nearest Neighbor in high dimensional spaces are
often unusable [18].
The process is illustrated in Figure (2) with a toy example consisting of three Gaus-
sians and a couple of hybrid cases placed in between. For the purposes of this discussion
we labeled the five different groups of points. The three clusters are labeled as 1, 3 and
5. The hybrid cases are labeled as 2 and 4. We can see that the labeling resulting from
K-Means (upper left plot) and the consistency method differ (upper right, lower left).
The final voting solution (lower right) identifies hybrid cases that can then be removed 2 .
Using the method outlined above results in roughly half the cases of our data being
eliminated. The stability of these central core cases - as retained in the consensus model
- is shown by the almost identical matching of these core cases between the consensus
model and the bagged K-means solution (κ = .992, η = .994) and also to the original
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
2. Results
2A color version of this figure for easier viewing is available at the following URL:
http://markus-breitenbach.com/figures/ecml_fig2.jpg
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 129
Class Label
Inputs 1 (n=83) 2 (n=103) 3 (n=85) 4 (n=151) 5 (n=197) 6 (n=146) 7 (n=130)
FamCrime
SubTrbl
Impulsiv
ParentConf
ComDrug
Aggress
ViolTol
PhysAbuse
PoorSuper
Neglect
AttProbs
SchoolBeh
InconDiscp
EmotSupp
CrimAssoc
YouthRebel
LowSES
FamilyDisc
NegCognit
Manipulate
HardDrug
SocIsolate
CrimOpp
EmotBonds
LowEmpath
Nhood
LowRemor
Promiscty
SexAbuse
LowProsoc
AcadFail
LowGoals
Figure 3. Resulting Cluster Means: Mean Plots of External Criminal History Measures Across Classes from
the Core Consensus Solution with Bootstrapped 95% Confidence Limits.
Cluster 1. Internalizing Youth A: Withdrawn, Abused and Rejected . This present clus-
ter is dominated by extreme family abuse and an internalizing pattern of social with-
drawal, hostility and suspicion. These youths come from very poor families (LowSES)
that are highly disorganized (FamilyDiscontinuity) and have a history of high crime/drug
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
130 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
dominance, low remorse, criminal peers, high risk lifestyle, drug abuse and serious crim-
inal history. This type has little evidence of sexual or physical abuse and is in rebellion
against their parents. School performance is relatively poor (AcadFail) along with atten-
tion problems (AttProbs), disruptive school behaviors (SchoolBeh) and the youth have
few pro-social activities after school (LowProsocial). This cluster’s official criminal his-
tory coheres with the above extreme profile. This cluster has the highest mean number
for both adjudications and detentions compared to all others.
Cluster 4. Normal “Accidental/Situational” Delinquents . We found two clusters of
broadly normal youth (Clusters 4 and 7). Cluster 4 reflects mostly “normal” youth with
few risk factors. This benign pattern, plus their late age at first adjudication and mostly
minor delinquency, appears a good match for the AL type described in [1]. This type
scores lower than average on all the scales. Their personality pattern shows no clear
tendency towards low self control.
Cluster 5. Internalizing Youth B: With Positive Parenting . Cluster 5 and Cluster 1 both
exhibit the internalizing pattern of social withdrawal, isolation and mistrust. Both also
avoid delinquent peers, drugs, low adjudication rates and arguably belong in a single
large “internalizing” cluster. This cluster matches the “neurotic” offender category in [2].
This internalizing pattern (like Cluster 1) has above average negative social attributions
(NegCognit), hostile aggression (Aggress) and social withdrawal (SocIsolate). The so-
cial isolation is perhaps linked to a relatively low-risk lifestyle reflected by avoidance
of delinquent peers (CrimAssoc), common drugs (ComDrug), hard drugs (HardDrug)
and promiscuity. It profoundly differs from Cluster 2 by the presence of caring, com-
petent and non-abusive parents, who are not or neglectful and who do not shirk their
supervision. These families give little evidence of serious disorganization (FamilyDisc)
and have lower than average family crime/drugs (FamCrime) and low parental conflict
(ParentConf).
Cluster 6. Low-control B: Early Onset, Versatile Offenders with Multiple Risk Factors
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
. Cluster 6 is a more extreme variant of Cluster 3. This profile appears well matched to
“secondary psychopath” in [2] and “primary sociopath” in [3]. These youth score above
average on every scale. These youth follow a high risk lifestyle, associate with anti-social
peers and have the highest scores for soft drugs, hard drugs, drug related trouble (Sub-
Trbl) and promiscuity. Their personality shows above average impulsivity, manipulative-
dominance and tolerance of violence. At school they show disruptive school behavior,
attention problems but only moderately above average failure. Their families show pat-
terns of poor supervision and neglect, are in serious conflict with each other (ParentConf)
while the youth shows extreme rebellion against the parents (YouthRebel).
Cluster 7. Normative Delinquency: Drugs, Sex and Peers . This cluster, along with
Cluster 4, also reflects “normal” youth with substantial school and family strengths.
However, Cluster 7, unlike Cluster 4, has vulnerabilities to drugs (ComDrug, HardDrug,
SubTrbl), sex (Promiscty) and criminal peers (CrimAssoc). Their personality appears be-
nign with few signs of low self-control or social isolation. Their official record coheres
with this profile showing an older age at first arrest and mostly the normative deviance
that is widespread among most youth [24].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 131
complexity of delinquent behavior, the probabilistic nature of most risk factors and mul-
tiplicity of causal factors. Additionally, our findings on boundary conditions and non-
classifiable cases must remain provisional since refinements to our measurement space
may reduce boundary problems. Specifically, it is known that the presence of noise and
non-discriminating variables can blur category boundaries [13]. Further research may
clarify the discriminating power of all classification variables (features) and gradually
converge on a reduced space of only the most powerful features.
2.2. Classification
Since we now have a labeled dataset we now examine how well a classifier is able to
discriminate between the classes we have identified. We use a linear support vector ma-
chine in a one-against-one setting to build classifiers for the seven classes in the data set.
We use 10-fold cross-validation and random 90%/10% split to determine an estimate
of the error for future unseen data. In Table 1 we can see that a linear SVM can easily
discriminate between the classes with more than 90 percent accuracy.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
132 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
Table 1. Cross validation results with a linear support vector machine on the COMPAS data (93.6387%
correct using cross-validation; 92% correct using 90/10 splits estimates).
a b c d e f g ← classified as
133 2 1 0 6 2 0 a=1
3 248 4 3 3 3 2 b=2
0 4 224 0 1 3 3 c=3
0 7 0 248 3 0 6 d=4
4 5 3 1 268 0 7 e=5
1 2 3 0 0 171 0 f=6
0 2 3 4 8 1 180 g=7
2.3. Replication
To verify our findings and ensure that the clusters we have found are not artefacts of our
sample, we used a cross-replication and cluster validation design proposed by MacIntyre
and Blashfield [25,26]. This method requires that one repeats the original analysis on a
replication sample (B) using identical methods. Furthermore, the cases of the replication
sample are assigned to the clusters of the original sample using a classification procedure.
The similarity of the two assignments is then compared in cross-tabulations.
For our purposes we were using Support Vector Machine [27] models we trained on
the labels we obtained for the original sample in a 1-vs-all along with a second model
trained in a 1-vs-1 setting. Both models have an equally good classification performance,
and in order to avoid erroneous assignments we require that both models agree on the
class.
The replication sample (B) of 1,453 youth was assessed using identical instruments.
This sample consisted of successive admissions to juvenile assessment centers at four
urban judicial districts in a western state that was not included in the training sample.
The sample is 67% male delinquents. The average age is 15.6 years (SD = 1.6) and
ranges from 9.0 years to 18.0 years. The ethnicity breakdown in the sample is 54.2%
Caucasian, 17.5% African American, 23.9% Latino/a, and 4.4% other ethnic groups.
Approximately 70% of these youth had entered the juvenile assessment centers after an
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
arrest for a misdemeanor or felony offense, while the remainder were brought in for
others reasons, including status offenses, school referrals, and family issues. Fifty-five
percent of the sample had no adjudications. Sample B contains fewer serious delinquents
than the original sample A.
The overall comparison of the initial cluster solution (B2) and replication solu-
tion (B1) produced a strong and significant global relation (Contingency Coefficient =
0.84, p > .0001). Yet, some differences did exist. The specific matching was not always
exact and the initial pattern 6 completely failed to replicate. The absence of cluster 6
prevented the computation of a kappa coefficient due to different numbers of classes in
B1 and B2. The missing cluster is understandable because the most serious delinquents
would be unlikely to be referred to the juvenile assessment centers.
The SVM classification indicated that overall 70% of the replication sample was
assigned to one of the original seven patterns. The most frequent cluster for both boys and
girls was the relatively low-risk/low-delinquency cluster 4. Cluster 6 (the most serious
delinquent profile) was the least frequent, with only 1.6% for boys and 1.9% for girls.
Additionally, all seven of the original clusters were recovered by the SVM; although,
as noted, very few cases in the replication sample matched cluster 6. Overall, 30% of
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 133
the replication sample failed to meet the matching criteria of the SVM and remained
unclassified.
2.4. Limitations
The present research has limitations and could be extended in several directions. Our
sample while large and fairly heterogeneous was limited to two state jurisdictions (Geor-
gia and North Dakota) and one county jurisdiction (Ventura, California). Additionally,
our sample did not cover the entire spectrum of juvenile justice agencies but was domi-
nated by committed youth. These sample characteristics limit the generalizability of our
findings.
The selection of taxonomic methods is also a difficult issue and several alternative
approaches are possible. We utilized only a limited set of potentially appropriate pattern
seeking methods. Nagin and Paternoster [28] acknowledge there is no clear consensus
on the most appropriate methods to study population heterogeneity and suggest that re-
searchers should explore different methods with different assumptions. Alternatives in-
clude several families of cluster analysis, latent class models, Meehl’s [29] taxometric
methods and semi-parametric mixed Poisson models [30,13,28]. We adopted this sug-
gestion by using several classical density-seeking methods and a more recent method
embodying different mathematical assumptions to identify pattern structure.
Another methodological limitation is the unresolved challenge of finding an optimal
value of K. [31] list over 30 different approaches to this problem. Ultimately, as in many
recent studies [32], we relied on a combination of methods, as well as interpretative
clarity. The K = 7 solution is tentative, and we acknowledge that the more parsimonious
K = 5 solution (not discussed here) may have advantages.
A perennial difficulty in any taxonomic study is the selection (coverage) and focus
of classification factors or classification space. Any specific selection inevitably imposes
a limitation on the knowledge claims and inferences that can be made regarding the re-
sulting types - and will inevitably omit other explanatory perspectives. In contrast several
prior studies adopted a broad holistic person-centered strategy, recommended by Mag-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
nusson [33], by using comprehensive multivariate coverage of key factors. Our present
approach to selecting features was guided by several current theories of delinquency and
the extant taxonomic literature, and included a spectrum of family, peer, school, com-
munity, cognition and personality domains. A key omission may be our limited cov-
erage of mental health factors. The distinctions between our two internalizing clusters
could perhaps gain from a deeper assessment of mental health issues. A current study
(in progress) has added mental health assessment to the current measurement space. Pre-
liminary results show that depression and suicide risk, as expected, correlate highly with
social isolation and negative social cognition scales.
A related methodological challenge is that irrelevant or poorly discriminating vari-
ables inevitably add noise and may blur boundaries between clusters [13]. This issue
was recently reviewed and new methodological approaches to its resolution were offered
[34]. In ongoing research we are exploring these and other alternatives to the possible
refinement of this classification space.
In conclusion, we agree with both Nagin and Paternoster [28] and Lykken [2] that
we are still at early stages in mapping the taxonomic heterogeneity of delinquency – from
both behavioral and explanatory perspectives. Although this study has produced several
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
134 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
replications and extensions of the prior taxonomic research in delinquency, it has also
revealed some of the complexities of both the vertical and horizontal structures regarding
delinquency.
3. Conclusion
In this paper we report on several difficult issues in finding clusters in a large sample
of delinquent youth using the Youth COMPAS assessment instrument. This instrument
contains 32 social and psychological scales that are widely used in assessing criminal
and delinquent populations.
Cluster analysis methods (Ward’s method, standard k-means, bagged k-means and
a semi-supervised pattern learning technique) were applied to the data. Cross-method
verification and external validity were examined. Core or exemplar cases were identified
by means of a voting (consensus) procedure. Seven recurrent clusters emerged across
replications.
The clusters identified were Internalizing Youth A [2,19,20]. Socialized Delinquents
[21,22,23], Versatile Offenders [2]. Normal Accidental Delinquents [1]. Internalizing
Youth B [2], Low-control Versatile Offenders [2,3] and Normative Delinquency [24].
Each of these clusters was found to relate fairly clearly to types previously identified in
various studies in the criminology literature, but had never been identified at the same
time in one data set using clustering methods. Additionally, the present analysis provides
a more complete set of empirical descriptions for these recurring types than offered in any
previous studies. This is the first study in which most of the well replicated patterns were
identified purely from the data using unsupervised learning and clustering methods. Most
prior studies provide only partial theoretical or clinical descriptions, omit operational
type-identification procedures and offer only a limited coverage of the critical features.
In this project we introduced a novel way of hybrid-case elimination in an unsuper-
vised setting. Although we are still working on establishing a more theoretical founda-
tion of this approach it has given results that are readily recognized and interpreted by
delinquency counselors in applied juvenile justice settings. Following the establishment
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
of these clusters a classifier was developed from the data to efficiently classify new cases.
A further methodological lesson was that the initial solution obtained using an elaborate
outlier removal process using Ward’s linkage and regular K-Means was easily replicated
using the Bagged K-Means without outlier removal or other “manual” operations. The
present project has suggested that Bagged K-Means appears to be very robust against
noise and outliers.
References
[1] T. E. Moffitt. Adolescence-limited and life-course persistent antisocial behavior: A developmental tax-
onomy. Psychological Review, 100(4):674–701, 1993.
[2] D. Lykken. The Antisocial Personalities. Lawrence Erlbaum, Hillsdale, N.J., 1995.
[3] L. Mealey. The sociobiology of sociopathy: An integrated evolutionary model. Behavioral and Brain
Sciences, 18(3):523–599, 1995.
[4] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Schölkopf. Learning with local and global con-
sistency. In L. Saul S. Thrun and B. Schölkopf, editors, Advances in Neural Information Processing
Systems 16, Cambridge, Mass., 2004. MIT Press.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders 135
[5] D.P. Farrington. Integrated Developmental and Life-Course Theories of Offending. Transaction Pub-
lishers,London, 2005.
[6] A. R. Piquero and T.E. Moffitt. Integrated Developmental and Life-Course Theories of Offending,
chapter Explaining the facts of crime: How developmental taxonomy replies to Farrington’s Invitation.
Transaction Publishers,London, 2005.
[7] D.W. Osgood. Making sense of crime and the life course. Annals of AAPSS, 602:196–211, 2005.
[8] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum-likelihood from incomplete data via the em algo-
rithm. Journal of the Royal Statistical Society, 39, 1977.
[9] Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik. Voting-merging: An ensemble method for
clustering. In Lecture Notes in Computer Science, volume 2130, page 217. Springer Verlag, Jan 2001.
[10] Alexander P. Topchy, Anil K. Jain, and William F. Punch. Combining multiple weak clusterings. In
Proceedings of the ICDM, pages 331–338, 2003.
[11] Cheng-Ru Lin and Ming-Syan Chen. Combining partitional and hierarchical algorithms for robust and
efficient data clustering with cohesion self-merging. In IEEE Transactions on Knowledge and Data
Engineering, volume 17, pages 145 – 159, 2005.
[12] T. Brennan, M. Breitenbach, and W. Dieterich. Towards an explanatory taxonomy of adolescent
delinquents: Identifying several social-psychological profiles. Journal of Quantitative Criminology,
24(2):179–203, 2008.
[13] G. W. Milligan. Clustering and Classification, chapter Clustering validation: Results and implications
for applied analyses., pages 345–379. World Scientific Press, River Edge, NJ, 1996.
[14] J. Han and M. Kamber. Data Mining - Concepts and Techniques. Morgan Kauffman, San Francisco,
2000.
[15] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[16] S. Dolnicar and F. Leisch. Getting more out of binary data: Segmenting markets by bagged clustering.
Working Paper 71, SFB ‘Adaptive Information Systems and Modeling in Economics and Management
Science”, August 2000.
[17] R Development Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.
[18] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is ‘nearest neighbor”
meaningful? Lecture Notes in Computer Science, 1540:217–235, 1999.
[19] M. Miller, D. Kaloupek, A. Dillon, and T. Keane. Externalizing and internalizing subtypes of combat-
related PTSD: A replication and extension using the PSY-5 scales. Journal of Abnormal Psychology,
113(4):636–645, 2004.
[20] A. Raine, T. E. Moffitt, and A. Caspi. Neurocognitive impairments in boys on the life-course persistent
antisocial path. Journal of Abnormal Psychology, 114(1):38–49, 2005.
[21] W. Miller. Lower-class culture as a generating milieu of gang delinquency. Journal of Social Issues,
14:5–19, 1958.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
[22] C. F. Jesness. The Jesness Inventory Classification System. Criminal Justice and Behavior, 15(1):78–91,
1988.
[23] M. Q. Warren. Classification of offenders as an aid to efficient management and effective treatment.
Journal of Criminal Law, Criminology, and Police Science, 62:239–258, 1971.
[24] T. E. Moffitt A., Caspi, M. Rutter, and P. A. Silva. Sex Differences in Antisocial Behaviour. Cambridge
University Press, Cambridge,Mass., 2001.
[25] R. McIntyre and R. Blashfield. A nearest-centroid technique for evaluating the minimum-variance clus-
tering procedure. Multivariate Behav Research, 15(2):225–238, 1980.
[26] A. Gordon. Classification. Chapman and Hall, New York, 1999.
[27] V. Vapnik. The Nature of Statistical Learning Theory. Wiley, NY, 1998.
[28] D. Nagin and Raymond Paternoster. Population heterogeneity and state dependence: State of the evi-
dence and directions for future research. Journal Of Quantitative Criminology, 16(2):117–144, 2000.
[29] P. Meehl and L.J. Yonce. Taxometric analysis. i: Detecting taxonicity with two quantitative indicators
using means above and below a sliding cut (mambac procedure). Psychological reports, 74, 1994.
[30] T. Brennan. Classification: An overview of selected methodological issues. In D. M. Gottfredson and
M. Tonry, editors, Prediction and Classification: Criminal Justice Decision Making, pages 201–248.
University of Chicago Press, Chicago, 1987.
[31] G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters
in a data set. Psychometrika, 50:159–79, 1985.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
136 M. Breitenbach et al. / Clustering of Adolescent Criminal Offenders
[32] P. T Costa, J.H Herbst, R.R. McCrae, J. Samuels, and D. J. Ozer. The replicability and utility of three
personality types. European Journal of Personality, 16:73–87, 2002.
[33] D. Magnusson. The individual as an organizing principle in psychological inquiry: A holistic approach.
In L-G Nillson Lars R. Bergman, R.B Cairns and L. Nystedt, editors, Developmental Science and the
Holistic Approach, pages 33–47. Lawrence Erlbaum, Mahwah: New Jersey, 2000.
[34] A.E. Raftery and Nema Dean. Variable selection for model-based clustering. Technical Report 452,
Department of Statistics, University of Washington, May 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 137
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-137
Introduction
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Smith School of Business, University of Maryland, College Park, MD 20742; E-mail: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
138 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
dicted price. One of the difficulties with such an approach is that information in the on-
line environment changes constantly: new auctions enter the market, old (i.e. closed)
auctions drop out, and even within the same auction the price changes continuously with
every new incoming bid. Thus, a well-functioning forecasting system must be adaptive
to accommodate a constantly changing environment.
We propose a dynamic forecasting model that can adapt to change. In general, price
forecasts can be done in two different ways, in a static or in a dynamic way. The static
approach relates information that is known before the start of the auction to information
that becomes available after the auction closes. This is the basic principle of several
existing models [1,2,3,4]. For instance, one could relate the opening bid, the auction
length and a seller’s reputation to the final price. Notice that opening bid, auction length,
and seller reputation are all known at the auction start. Training a model on a suitable set
of past auctions, one can obtain static forecasts of the final price in that fashion. However,
this approach does not take into account important information that arrives during the
auction. The current number of competing bidders or the current price level are factors
that are only revealed during the ongoing auction and that are important in determining
the future price. Moreover, the current change in price also has a huge impact on the
future price. If, for instance, the price had increased at an extremely fast rate over the
last several hours, causing bidders to drop out of the bidding process or to revise their
bidding strategies, then this could have an immense impact on the evolution of price in
the next few hours and, subsequently, on the final price. We refer to models that account
for newly arriving information and for the rate at which this information changes as
dynamic models.
Dynamic price forecasting in online auctions is challenging for a variety of reasons.
Traditional methods for forecasting time-series, such as exponential smoothing or mov-
ing averages, cannot be applied in the auction context, at least not directly, due to the
special data structure. Traditional forecasting methods assume that data arrive in evenly-
spaced time intervals such as every quarter or every month. In such a setting, one trains
the model on data up to the current time period t, and then uses this model to predict
at time t + 1. Implied in this process is the assumption that the distance between two
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
adjacent time periods is equal, which is the case for quarterly or monthly data. Now
consider the case of online auctions. Bids arrive in very unevenly-spaced time intervals,
determined by the bidders and their bidding strategies, and the number of bids within a
short period of time can sometimes be very sparse, while othertimes extremely dense. In
this setting, the distance between t and t + 1 can sometimes be more than a day, while at
other times it may only be a few seconds. Traditional forecasting methods also assume
that the time-series continues, at least in theory, for an infinite amount of time and does
not stop at any point in the near future. This is clearly not the case in a 5- or 7-day online
auction. The implication of this is a discrepancy in the estimated forecasting uncertainty.
And lastly, online auctions, even for the same product, can experience price paths with
very heterogeneous price dynamics [5,6]. By price dynamics we mean the speed at which
price travels during the auction and the rate at which this speed changes. Traditional
models do not account for instantaneous change and its effect on the price forecast. This
calls for new methods that can measure and incorporate this important information.
In this work we propose a new approach for forecasting price in online auctions.
The approach allows for dynamic forecasts in that it incorporates information from the
ongoing auction. It accommodates the unevenly spacing of data, and also incorporates
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models 139
change in the price dynamics. Our forecasting approach is housed within the principles
of functional data analysis [7]. In Section 1 we explain the principles of functional data
analysis and derive our functional forecasting model in Section 2. We apply our model
to a set of bidding data for a variety of book auctions in Section 3 . We conclude with
further remarks in Section 4.
A functional data set consists of a collection of continuous functional objects such as the
price paths in an online auction. Despite their continuous nature, limitations in human
perception and measurement capabilities allow us to observe these curves only at discrete
time points. Thus, the first step in a typical functional data analysis is to recover (or
estimate), from the observed data, the underlying continuous functional object [7]. This
is usually done with the help of data smoothing.
A variety of different smoothing methods exist. One very flexible and computational
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
efficient choice is the penalized smoothing spline [8]. Let τ1 , . . . , τL be a set of knots.
Then, a polynomial spline of order p is given by
L
f (t) = β0 + β1 t + · · · + βp tp + βpl (t − τl )p+ , (1)
l=1
where u+ = uI[u≥0] denotes the positive part of the function u. Define the roughness
penalty
PENm (t) = {Dm f (t)}2 dt, (2)
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
140 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
where y(t) denotes the observed data at time t and the smoothing parameter λ controls
the trade-off between data-fit and smoothness of the function f . Using m = 2 in (3)
leads to the commonly encountered cubic smoothing spline. Other possible smoothers
include the use of B-splines or radial basis functions [8].
The choice of the knots influences the resulting smoothing spline. Our goal is to
obtain smoothing splines that represent, as much as possible, the price formation process.
To that end, our selection of knots mirrors the distribution of bid arrivals [9]. We also
choose the smoothing parameter λ to balance data-fit and smoothness [10].
The process of going from observed data to functional data is now as follows. For a
set of n functional objects, let tij denote the time of the jth observation (1 ≤ j ≤ ni ) on
the ith object (1 ≤ i ≤ n), and let yij = y(tij ) denote the corresponding measurements.
Let fi (t) denote the penalized smoothing spline fitted to the observations yi1 , . . . , yini .
Then, functional data analysis is performed on the continuous curves fi (t) rather than on
the noisy observations yi1 , . . . , yini . For ease of notation we will suppress the subscript
i and write yt = f (t) for the functional object and D(m) yt = f (m) (t) for its mth
derivative.
Consider Figure 1 for illustration. The circles in the top panel of Figure 1 correspond
to a scatterplot of the bids (on log-scale) versus their timing. The continuous curve in
the top panel shows a smoothing spline of order m = 4 using a smoothing parameter
λ = 50.
One of our modeling goals is to capture the dynamics of an auction. While yt de-
scribes the magnitude of the current price, it does not reveal the dynamics of how fast
the price is changing or moving. Attributes that we typically associate with a moving
object are its velocity (or its speed) as well as its acceleration. Note that we can compute
the price velocity and price acceleration via the first and second derivatives, D(1) yt and
D(2) yt , respectively.
Consider again Figure 1. The middle panel corresponds to the price velocity, D(1) yt .
Similarly, the bottom panel shows the price acceleration, D(2) yt . The price velocity has
several interesting features. It starts out at a relatively high mark which is due to the
starting price that the first bid has to overcome. After the initial high speed, the price
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
increase slows down over the next several days, reaching a value close to zero mid-
way through the auction. A close-to-zero price velocity means that the price increase is
extremely slow. In fact, there are no bids between the beginning of day 2 and the end of
day 4 and the price velocity reflects that. This is in stark contrast to the price increase on
the last day where the price velocity picks up pace and the price jumps up!
The bottom panel in Figure 1 represents price acceleration. Acceleration is an im-
portant indicator of dynamics since a change in velocity is preceded by a change in accel-
eration. In other words, a positive acceleration today will result in an increase of velocity
tomorrow. Conversely, a decrease in velocity must be preceded by a negative acceler-
ation (or deceleration). The bottom panel in Figure 1 shows that the price acceleration
is increasing over the entire auction duration. This implies that the auction is constantly
experiencing forces that change its price velocity. The price acceleration is flat during the
middle of the auction where no bids are placed. With every new bid, the auction experi-
ences new forces. The magnitude of the force depends on the size of the price-increment.
Smaller price-increments will result in a smaller force. On the other hand, a large number
of small consecutive price-increments will result in a large force. For instance, the last 2
bids in Figure 1 arrive during the final moments of the auction. Since the increments are
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models 141
Current Price
3.7
Log−Price
3.5
3.3
0 1 2 3 4 5 6 7
Day of Auction
Price Velocity
First Derivative of Log−Price
0.00 0.05 0.10 0.15
0 1 2 3 4 5 6 7
Day of Auction
Price Acceleration
Second Derivative of Log−Price
0.00 0.04
−0.06
0 1 2 3 4 5 6 7
Day of Auction
Figure 1. Current price, price velocity (first derivative) and price acceleration (second derivative) for a selected
auction. The first graph shows the actual bids together with the fitted curve.
relatively small, the price acceleration is only moderate. A more systematic investigation
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
As pointed out earlier, the goal is to develop a dynamic forecasting model. By dynamic
we mean a model that operates in the live-auction and forecasts price at a future time
point of the ongoing auction. This is in contrast to a static forecasting model which makes
prediction only about the final price, and which takes into consideration only information
available before the start of the auction. Consider Figure 2 for illustration. Assume that
we observe the price path from the start of the auction until time t (solid black line).
We now want to forecast the continuation of this price path (broken grey lines, labelled
"A", "B", and "C"). The difficulty in producing this forecast is the uncertainty about the
price dynamics in the future. If the dynamics level-off, then the price increase will slow
down and we might see a price path similar to A. If the dynamics remain steady, the price
path might look more like the one in B. Or, if the dynamics sharply increase, then a path
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
142 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
$100 - C
P A
r
i $50 -
c
e
Observed Price Path
$0 -
1 2 3
Day
4 5
t 6 7
like the one in C could be the consequence. Either way, knowledge of the future price
dynamics appears is a key factor!
Our dynamic forecasting model consequently consists of two parts: First, we develop
a model for price dynamics. Then, using estimated dynamics, together with other relevant
covariates, we derive an econometric model of the final price and use it to forecast the
outcome of an auction.
We pointed out earlier that one of the main characteristics of online auctions is their
rapid change in dynamics. Since change in the p + 1st derivative precedes change in the
pth derivative (e.g. change in acceleration precedes change in velocity), we make use of
derivative information for forecasting. In the following, we develop a model to estimate
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models 143
Forecasting is also done in two steps. Let 1 ≤ t ≤ T denote the observed time
period and let T + 1, T + 2, T + 3, . . . denote time periods we wish to forecast. We first
forecast the next residual via
Using this forecast, we can predict the derivative at the next time point T + 1 via
After forecasting the price dynamics, we use these forecasts to predict the auction
price over the next time periods up to the auction end. Many factors can affect the
price in an auction such as information about the auction format, the product, the
bidders and the seller. Let x(t) denote the vector of all such factors. Let d(t) =
(D(1) yt , D(2) yt , . . . , D(p) yt ) denote the vector of price dynamics, i.e. the vector of the
first p derivatives of y at time t. The price at t can be affected by the price at t − 1 and
potentially also by its values at times t − 2, t − 3, etc. Let l(t) = (yt−1 , yt−2 , . . . , yt−q )
denote the vector of the first q lags of yt . We then write the general dynamic forecasting
model as follows:
where β, γ and δ denote the parameter vectors and t ∼ N (0, σ 2 ). We use the estimated
model (9) to predict the price at T + l as
3. Empirical Results
3.1. Data
Our data set is diverse and contains 768 eBay book auctions from October 2004. All auc-
tions were 7 days long and span a variety of categories (see Table 1). Prices range from
$0.10 to $999 and are, not unexpectedly, highly skewed. Prices also vary significantly
across the different book categories. This data set is challenging due to its diversity in
products and price. We use 70% of these auctions (or 538 auctions) for training purposes.
The remaining 30% (or 230 auctions) are kept in the validation sample.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
144 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
Table 1. Categories of 768 book auctions. The second column gives the number of auctions per category. The
third and fourth column show average and standard deviation of price per category.
Our model building investigations suggest that among all price dynamics only velocity
f (t) is significant for forecasting price in our data. We thus estimate model D(m) yt
in (4) only for m = 1. We do so in the following way. Using a quadratic polynomial
(k = 2) in time t and influence-weighted [10] predictor variables for book-category
(x̃1 (t)) and shipping costs (x̃2 (t)) results in an AR(1) process for the residuals ut (i.e.
p = 1 in (5)). The rationale behind using book-category and shipping costs in model
(4) is that we would expect the dynamics to depend heavily on these two variables. For
instance, the category of antiquarian and collectible books typically contains items that
are of rare nature and that appeal to a market that is not very price sensitive and with
a strong interest in obtaining the item. This is also reflected in the large average price
and even larger variability for items in this category (Table 1). The result of these market
differences may well be a different price evolution and thus different price dynamics. A
similar argument applies to shipping costs. Shipping costs are determined by the seller
and act as a "hidden" price premium. Bidders are often deterred by excessively high
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
shipping costs and as a consequence auctions may experience differences in the price
dynamics. Table 2 summarizes the estimated coefficients averaged across all auctions
from the training set. We can see that both book-category and shipping costs result in
significantly different price dynamics.
Table 2. Estimates for the velocity model D (1) yt in (4). The second column reports the estimated parameter
values and the third column reports the associated significance levels. Values are averaged across the training
set.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models 145
After modeling the price dynamics we estimate the price forecasting model (9). Re-
call that (9) contains three model components, x(t), d(t) and l(t). Among all reasonable
price-lags only the first lag is influential, so we have l(t) = yt−1 . Also, as mentioned
earlier, among the different price dynamics we only find the velocity to be important,
so d(t) = D(1) yt . The first two rows of Table 3 display the corresponding estimated
coefficients.
Note that both l(t) and d(t) are predictor variables derived from price, either
from its lag or from its dynamics. We also use 8 non-price related predictor variables
x(t) = (x1 (t), x2 (t), x3 (t), x̃4 (t), x̃5 (t), x̃6 (t), x̃7 (t), x̃8 (t))T . Specifically, the 8 pre-
dictor variables correspond to the average rating of all bidders until time t (which we
refer to as the current average bidder rating at time t and denote as x1 (t)), the current
number of bids at time t (x2 (t)), and the current winner rating at time t (x3 (t)). These
first 3 predictor variables are time-varying. We also consider 5 time-constant predictors:
the opening bid (x̃4 (t)), the seller rating (x̃5 (t)), the seller’s positive ratings (x̃6 (t)), the
shipping costs (x̃7 (t)), and the book category (x̃8 (t)), where x̃i (t) again denotes the
influence-weighted variables.
Table 3 shows the estimated parameter values for the full forecasting model. It is
interesting to note that book-category and shipping costs have low statistical significance.
The reason for this is that their effects have likely already been captured satisfactorily in
the model for the price velocity. Also notice that the model is estimated on the log-scale
for better model fit. That is, the response yt and all numeric predictors (x̃1 (t), . . . , x̃7 (t))
are log-transformed. The implication of this lies in the interpretation of the coefficients.
For instance, the value 0.051 implies that for every 1% increase in opening bid, the price
increases by about 0.05%, on average.
Table 3. Estimates for the price forecasting model (9). The first column indicates the part of the model design
that the predictor is associated with. The third column reports the estimated parameter values and the fourth
column reports the associated significance levels. Values are again averaged across the training set.
We estimate the forecasting model on the training data and use the validation data to
investigate its forecasting accuracy. To that end we assume that for the 230 auctions in
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
146 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
MAPE
0.4
forecasting model
exponential smoothing
0.3
0.2
0.1
Figure 3. Mean absolute percentage error (MAPE) of the forecasted price over the last auction-day. The solid
line corresponds to our dynamic forecasting model; the dashed line correspond to double exponential smooth-
ing. The x-axis denotes the day of the auction.
the validation data we only observe the price until day 6 and we want to forecast the
remainder of the auction. We forecast price over the last day in small increments of 0.1
days. That is, from day 6 we forecast day 6.1, or the price after the first 2.4 hours of
day 7. From day 6.1 we forecast day 6.2 and so on until the auction-end, at day 7. The
advantage of a sliding-window approach is the possibility of feedback-based forecast
improvements. That is, as the auction progresses over the last day, the true price level
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
can be compared with its forecasted level and deviations can be channelled back into the
model for real-time forecast adjustments.
Figure 3 shows the forecast accuracy on the validation sample. We measure fore-
casting accuracy using the mean absolute percentage error (MAPE), that is
230
1 (Predicted Pricet,i − True Pricet,i )
MAPEt = ,
230 True Price t,i
i=1
i = 1, . . . , N ; t = 6.1, . . . , 7,
where i denotes the ith auction in the validation data. The solid line in Figure 3 corre-
sponds to MAPE for our dynamic forecasting model. We benchmark the performance of
our method against double exponential smoothing. Double exponential smoothing is a
popular short term forecasting method which assigns exponentially decreasing weights
as the observation become less recent and also takes into account a possible (changing)
trend in the data. The dashed line in Figure 3 corresponds to MAPE for double exponen-
tial smoothing.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models 147
We notice that for both approaches, MAPE increases as we predict further into the
future. However, while for our dynamic model MAPE increases to only about 5% at the
auction-end, exponential smoothing incurs an error of over 40%. This difference in per-
formance is relatively surprising, especially given that exponential smoothing is a well-
established (and powerful) tool in time series analysis. One of the reasons for this under-
performance is the rapid change in price dynamics, especially at the auction-end. Expo-
nential smoothing, despite the ability to accommodate changing trends in the data, can-
not account for the price dynamics. This is in contrast to our dynamic forecasting model
which explicitly models price velocity. As pointed out earlier, a change in a function’s
velocity precedes a change in the function itself, so it seems only natural that modeling
the dynamics makes a difference for forecasting the final price.
4. Conclusions
In this paper we develop a dynamic price forecasting model that operates during the live
auction. Forecasting price in online auctions can have benefits to different auction par-
ties. For instance, price forecasts can be used to dynamically rank auctions for the same
(or similar) item by their predicted price. On any given day, there are several hundred,
or even thousands of open auctions available, especially for very popular items such as
Apple iPods or Microsoft Xboxes. Dynamic price ranking can lead to a ranking of auc-
tions with the lowest expected price which, subsequently, can help bidders make deci-
sions about which auctions to participate in. Auction forecasting can also be beneficial
to the seller or the auction house. For instance, the auction house can use price forecasts
to offer insurance to the seller. This is related to the idea by [2], who suggest offering
sellers an insurance that guarantees a minimum selling price. In order to do so, it is im-
portant to correctly forecast the price, at least on average. While Ghani’s method is static
in nature, our dynamic forecasting approach could potentially allow more flexible fea-
tures like an “Insure-It-Now" option, which would allow sellers to purchase an insur-
ance either at the beginning of the auction, or during the live auction (coupled with a
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
time-varying premium). Price forecasts can also be used by eBay-driven businesses that
provide brokerage services to buyers or sellers.
And a final comment: In order for dynamic forecasting to work in practice, it is im-
portant that the method is scalable and efficient. We want to point out that all components
of our model are based on linear operations - estimating the smoothing spline in Section
3 or fitting the AR model in Section 4 are both done in ways very similar to least squares.
In fact, the total runtime (estimation on training data plus validation on holdout data) for
our dataset (over 700 auctions) is less than a minute, using program code that is not (yet)
optimized for speed.
References
[1] Ghani, R. and Simmons, H. (2004). Predicting the end-price of online auctions. In the Proceedings
of the International Workshop on Data Mining and Adaptive Modelling Methods for Economics and
Management, Pisa, Italy, 2004.
[2] Ghani, R. (2005). Price prediction and insurance for online auctions. In the Proceedings of the 11th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, 2005.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
148 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models
[3] Lucking-Reiley, D., Bryan, D., Prasad, N., and Reeves, D. (2000). Pennies from ebay: the determinants
of price in online auctions. Technical report, University of Arizona.
[4] Bajari, P. and Hortacsu, A. (2003). The winner’s curse, reserve prices and endogenous entry: Empirical
insights from ebay auctions. Rand Journal of Economics, 3:2:329–355.
[5] Jank, W. and Shmueli, G. (2008). Studying Heterogeneity of Price Evolution in eBay Auctions via
Functional Clustering. Forthcoming at Adomavicius and Gupta (Eds.) Handbook of Information Systems
Series: Business Computing, Elsevier.
[6] Shmueli, G. and Jank, W. (2008). Modeling the Dynamics of Online Auctions: A Modern Statistical
Approach. Forthcoming at Kauffman and Tallon (Eds.) Economics, Information Systems & Ecommerce
Research II: Advanced Empirical Methods, M.E. Sharpe, Armonk, NY.
[7] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer Series in Statistics.
Springer-Verlag New York, 2nd edition.
[8] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. Cambridge University
Press, Cambridge.
[9] Shmueli, G., Russo, R. P., and Jank, W. (2007). The Barista: A model for bid arrivals in online auctions.
The Annals of Applied Statistics, 1 (2), 412–441.
[10] Wang, S., Jank, W., and Shmueli, G. (2008). Forecasting ebay’s online auction prices using functional
data analysis. Forthcoming in The Journal of Business and Economic Statistics.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 149
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-149
Introduction
Many business decisions require a broad understanding of ways that various external
events might impact the business. For example, as managers formulate their company’s
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
1
Corresponding author: 50 West San Fernando Street, San Jose, California 95113; E-mail:
[email protected],
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
150 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
pose to Microsoft. While “poking around” on the Google corporate website, Gates
glanced at the listing of open positions, and he saw that Google was recruiting for all
kinds of expertise that had nothing to do with search. In fact, Gates noted, Google’s
recruiting goals seemed to mirror Microsoft’s! It was time to make defense against
Google a top priority for Microsoft.
This story shows the value of the Web as a source of business insight, but it also
illustrates how random and unsystematic the process of developing that Web-derived
insight can be. In practice, it is very difficult for any individual person (or even
reasonable-sized group of people) to scan the sheer volume of information, detect what
might be relevant, and do the necessary work to draw appropriate inferences and
connections to transform the raw information into useful business insights. This
process of scanning, detecting, and interpreting is not feasible to do manually at scale.
There are technologies and services, available today, that try to address this need,
but none of them offer a complete solution. Automated clipping services, for example,
can help filter the information stream, and thus represent a step in the right direction.
These services, however, do not help decision makers see the potential implications
that new pieces of information have on their organization’s specific concerns. Many
insights can only be generated by putting together several pieces of raw data from
disparate sources and by applying the relevant business knowledge to interpret them.
For example, only a system that models Microsoft’s current niche and product mix
would be able to detect the relevance of Google’s recruiting priorities. Without such a
model, the system cannot analyze the indirect relationships between events it detects
and the business objectives of the company it seeks to inform, leading it to either
ignore important events or cast its net too broadly. Hence, decision makers need more
than a filtered news source; they need tools that can directly draw connections between
data collected from the Web and the issues that matter to their business.
In contrast, enterprise Business Intelligence (BI) systems can help decision makers
see the implications of new information, but BI systems focus primarily on exploiting
the information that is flowing through a company’s own data systems to help
executives understand what is happening within their business operations. Since not
everything worth monitoring happens within the enterprise, executives need a
capability that can extend the limited, inward-looking scope of existing enterprise BI
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 151
To enable applications that can systematically monitor the Web and turn it into a
source of business insight, we have been developing a technology platform upon which
a variety of corporate radar applications can be built (see Figure 1). This evolving
platform consists of three main components – semantic models, natural language
technologies (we call web sensors), and an inference engine – that interact with each
other to guide the detection of relevant signals from the Web, to produce from these
signals a stream of structured event descriptions, and to interpret the implications of
these events to generate actionable insights.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
The semantic models in our platform guide the detection and interpretation of relevant
events from the Web. There are three types of models in our platform – a model of the
business dynamic, a set of detection models, and a set of sensor models.
The business-dynamics model provides semantic representations of the ecosystem
in which a decision maker’s organization operates. To understand how this model is
used to drive processing in corporate radar applications, consider the following
example. Imagine that you manage a manufacturing company that attempts to use the
Web to discover actionable insights by running a system that monitors news stories and
price data. If your system can only notice price changes for your competitors’ products,
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
152 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
then the system would be of limited value because it might be too late to react once the
threat is that immediate. Moreover, something so directly relevant to the business will
most likely be noticed by company personnel (and hence eliminates the needed for an
automated solution). Now suppose instead that your system can notice price changes
for raw materials, rather than competing products, and moreover the raw materials are
not used in any of the products made by your company. If these raw materials are used
by your competitors, then a price change for any of these materials might have an
important, though indirect, impact on your business. Such price shifts happen all the
time, and humans trying to track and interpret all of these shifts may quickly become
overwhelmed. Suppose we have a system with a model describing who your
competitors are, which of their products compete with yours, and what raw materials
are used in each product. A relatively simple model like this, combined with basic
inferences about cost/price relationships, can enable a corporate radar to see that
although you do not use the raw material in question, a drop in the price of that
material may mean that your competitor can lower the price of its products, thereby
putting price pressure on you (see Figure 2).
Figure 2. A simple model of a competitive ecosystem and the insight that can be generated based on events
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 153
Figure 3. Left: The semantic model for a deploy event. Right: The semantic model for a hypothetical
wireless company – Acme Wireless.
To enable reuse and hence reduce the effort needed to customize the business-
dynamics model across different applications and domains, our platform uses an upper
ontology of generic concepts that can be extended to build domain specific ones. This
upper ontology – the Component Library [2] – provides a library of about 500 generic
events and entities that can be composed (and extended) to build business-dynamics
models for a wide range of corporate radar applications.
Detection models provide support for natural language processing which is needed
to detect and convert unstructured text on the Web into a structured event description –
i.e. a representation of the event type and the semantic roles of the entities participating
in the event. We use WordNet [6] and case theory [1] as detection models in our
technology platform. WordNet provides the lexical realizations for events and entities
in a business-dynamics model – i.e. these events and entities are annotated with the
corresponding senses from WordNet to indicate how they surface in language. For
example, a deploy event is annotated with the WordNet senses of deploy#2, launch#5,
etc.
Case theory provides the syntactic realizations for the semantic roles that an entity
can play in an event. For example, an entity performing an event plays the semantic
role of an agent and this role – according to case theory – can surface as a prepositional
phrase marked by the preposition “by” (e.g. )
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Figure 4. The semantic model for a sensor that detects business acquisition events.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
154 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
The inference engine plays three core roles in our technology platform. 1) It uses the
business-dynamics model to determine which events to detect from the Web. This task
is accomplished by retrieving the events encoded in this model. 2) The inference
engine uses sensor models to determine which web sensors to invoke for the events to
be detected. A sensor is invoked if the event type encoded in its corresponding model
subsumes the event to be detected. 3) The inference engine generates actionable
insights from detected events by applying implications encoded in the corresponding
event representations from the business-dynamics model. The inference engine, for
example, can apply the implication encoded in the representation of the deploy event
(see Figure 3 right from Section 3.1) to a detected event about a competitor’s supplier
deploying a new product. The resulting inference will generate insight about the
competitor being able to improve its products that use the newly deployed one.
In addition to the above requirements, we also require the inference engine to
support (and reason over) expressive implications like the one in the above example –
i.e. implications that consider an entity’s role in a detected event and its relationship to
others within the overall ecosystem. This additional requirement improves the
relevancy of the insights generated. For example, if a supplier for your company
deploys a new product, then applying the same implication, from before, would result
in a completely different insight to be generated – i.e. your company would be able to
improve its existing products instead of your competitor.
We use the Knowledge Machine (KM) [5] to provide the requirements described
above, but other implementations such as Pellet [14] or Jena [12] can be used as well.
KM is a frame-based inference engine grounded in first order predicate logic. It
provides a query language to retrieve information of interest about the models – e.g.
events encoded in the business-dynamics model. KM also supports subsumption
reasoning – used to determine the appropriate sensors to invoke – and reasoning with
implication rules – used to interpret the implications of detected events.
KM provides additional capabilities such as reasoning about quantitative
constraints – e.g. at least one commercial sale; more than 10 deployments; etc. – and a
situation calculus to reason about how changes in the world relate to existing
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
information – e.g. reports of orders for chipsets may imply a handset deployment
within 6 months. These additional capabilities allow our technology platform to
support (and reason over) expressive implications to improve the relevance of the
insights generated.
It is worth mentioning that most corporate radar applications focus primarily on
weak signals, resulting in inferences that may not be sound. Hence, the resulting
inferences should be viewed as suggestions that the implied events could be happening
(as opposed to deductive conclusions that it is), so it is up to the user to decide the
likelihood of the implication actually occurring. To help the user make this judgment,
the inference engine provides provenance information about which weak signals led to
a particular implication. A weak signal combined with many others that point to the
same implication will give the user more cause to believe in the likelihood of that
implication over an implication resulting from a lone signal without any corroboration.
We are currently exploring more sophisticated (and automatic) ways of weighing the
likelihood of an implication, but at this point we leave this task to the user.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 155
Web sensors detect relevant unstructured signals on the Web and produce from them
structured event descriptions that are consumed by the inference engine to generate
actionable insights. Decisions about what types of sensors to use and how they are
implemented depend on the specific corporate radar application. Some corporate
radars, like the Business Event Advisor described in Section 2.1, use a single sensor to
produce structured representations for all events that need to be detected. Other radars,
like the Technology Investment Radar described in Section 2.2, employ a collection of
specialized sensors – each targeting a specific event such as sales, deployments, etc.
Hence, our technology platform must support a variety of different web sensors.
Our platform satisfies this requirement through the sensor and detection models (see
Section 3.1). The sensor models provide an abstraction of the sensors which allows the
inference engine to determine which sensors to invoke without regard for their
implementation – e.g. the model for a business acquisition sensor (see Figure 4)
abstracts away the implementation of this sensor by encoding information such as the
type of the event detected and the confidence in the output that the inference engine can
reason over.
The detection models (see Section 3.1 also) provide linguistic support that specific
implementations may use to process unstructured text on the Web – e.g. many natural
language processing algorithms [11,13,16,18] produce structured representations from
text using lexical (and syntactic) knowledge which our platform provides.
In this section, we describe the two corporate radar applications that we have built on
the technology platform described in the previous section.
The Business Event Advisor [9] was a prototype built in 2006 as an early attempt to
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
develop the kind of corporate radar we discussed above. The objective is to help
executives identify external events that might constitute threats to – and opportunities
for – their business. For example, there would be important business value for
executives who could more consistently notice signs that a competitor might introduce
a new product to directly compete with one of their products, or that a supplier was at
risk of failing to deliver. The current implementation of the business event advisor
detects a small set of event types, but the approach it employs is broadly applicable.
Everything from a competitor’s online job-recruiting advertisements to announcements
of deals made are types of information that – if systematically tracked and interpreted
in terms of the business dynamics that govern the executive’s business – can be used to
provide these early warning signals.
The system is designed to address these needs by detecting, organizing, and
interpreting a broad range of external business events in order to help business
decision-makers spot external threats and opportunities affecting their business. The
system achieves this capability using a model of the business dynamics that encodes
the entities and events that impact the ecosystem in which a particular company
operates. This model, for example, can encode entities like manufacturers, the products
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
156 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
they make, their suppliers, their customers, etc., and can encode events like executive
hirings, mergers and acquisitions, price changes, etc. The specific entities and events
encoded depend on the company that the model (and hence the application) is
customized for.
The Business Event Advisor uses this model to continuously scan a wide range of
news sources on the Web to generate an executive dashboard like the one detailed in
Figure 5. This dashboard makes it possible to see systematically the landscape of
relevant events – categorized by event type, participants in the event, estimated
importance, and the portion of the ecosystem impacted.
To detect and produce structured representations for events of interest from
unstructured text, the system employs a single web sensor built using an open-source
classification engine to determine the most likely event type combined with a
commercial natural language processing product to recognize the relevant entities. This
web sensor also used a library of syntactic patterns to determine the semantic roles
played by these entities. We refer the reader to [9] for a detailed discussion of how we
integrated these components.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Figure 5. A portion of the executive dashboard produced by the Business Event Advisor.
This application also allows the user to examine the details for any event such as
the raw signals from which the event was detected and the implications that are inferred
from it. Figure 6 details this feature for a product introduction event that the application
has detected. This event was detected from a story about Denso introducing new hybrid
vehicle components and suggests to corporate executives the possible threats (e.g.
) and opportunities (e.g.
+) that might impact their company.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 157
The Business Event Advisor is a working system, but not a complete or robust one
that is ready for real users. Its main purpose was to demonstrate the value of the
corporate radar vision and our technology platform. The ambitious scope of the
application conveyed that vision but was too broad for a small research team to build
out at scale. We next set out to build a second corporate radar, with a more focused
scope, that could be built out in full and provide value for real users.
Decision makers often recognize – early on – the potential for a technology to have an
important impact on their business, but have difficulty determining when this potential
will be realized. For example, many executives in the mobile phone industry recognize
WiMax as a technology that may have a significant impact on their industry, but they
are less certain about whether (and especially when) that impact will be realized. Some
technologies that look promising in the lab never make it to market; others that go to
market become niche products which never deliver on the impact they promised
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
originally; and finally those that do deliver on their promise may do so on a different
time-line than one might have anticipated when the technology first began to emerge.
In order to manage their company effectively, executives need to continuously track
technologies to determine when various levels of investment are worthwhile – e.g.
when to invest in building up in-house expertise on the technology; when to start
designing and offering products based on the technology; etc.
The Technology Investment Radar is designed to address these needs by helping
decision-makers track the maturation of technologies that relate to their business and
understand when these technologies are mature enough to justify investing in them.
The Technology Investment Radar achieves this capability by using a model of the
business dynamics that encodes the events and entities that impact the ecosystem
surrounding a company and how they affect the maturation of technologies that are
relevant to the company. This information about technology maturation, which we call
the technology lifecyle, encodes the following stages that a technology can advance
through as it matures.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
158 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
Associated with each stage are a set of gates (i.e. conditions) that must be met in
order for a technology to enter into that stage. These gates encode how entities (e.g.
manufacturers, suppliers, etc.) and events (e.g. sales, deployments, etc.) that impact the
target ecosystem determine which maturation stage a technology belongs in. For
example, some user’s company may require that there be at least five sales (or
deployments) of the technology in order for that technology to be considered as being
in the emerging stage. This is an example of a gate.
Like the Business Event Advisor, the Technology Investment Radar uses its
business-dynamics model to continuously scan a variety of sources – e.g. RSS feeds,
blogs, public forums, standards sites, etc. – to produce a dashboard. Its dashboard is
detailed in Figure 7 which shows the maturation stage that each technology has
advanced to.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 159
WiMax. Moreover, the user can further examine the details for any gate such as the
events that have been detected which support a gate.
Figure 8. A detailed view of the emerging stage in the Technology Investment Radar,
3. Evaluation
We customized the Technology Investment Radar for the Wireless CoP by modeling
the gates (and the events enabling them) that must be satisfied for each stage in the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
business-dynamics model for this group (see Section 2.2 for a description of these
stages). We acquired these gates by interviewing five analysts from the Wireless CoP
for the metrics and criteria they use in assessing technology maturity. The job
description of these analysts is to monitor developments for various wireless
technologies in order to inform internal investment decisions, provide strategy and
technology consulting to external clients, and so forth.
Once the system was customized, we used it to track seven different wireless
technologies: WiMax, Mobile WiMax, WiFi, HSDPA, HSUPA, EVDO, and Mobile
TV.
We then enlisted 11 additional analysts from the Wireless CoP to evaluate the
Technology Investment Radar. Our system was the only one evaluated in this pilot
study because a competing system does not exist.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
160 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
3.2. Accuracy
Table 1. The agreement between the Technology Investment Radar and human analysts on the maturity of
seven wireless technologies. The first column lists the technologies tracked. The second column lists the
agreement between our system and the human analysts given as percentages. The last column lists the
number of assessments made by the human analysts for each technology.
The overall agreement between the Technology Investment Radar and the human
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
analysts (and hence the accuracy of the maturity assessments produced by the system)
is 54.55%. Compared with chance agreement (which is 1/6) this difference is
statistically significant (p < 0.01 for the 2 test), but several technologies – e.g. WiFi
and HSDPA – had low agreement which led us to examine the cause. We found that a
recurring cause was disagreement with the business-dynamics model from the analysts.
For example, several analysts disagreed with a gate called Vendor Consolidation in the
business-dynamics model because the events enabling this gate – i.e. merger and
acquisition events – were too restrictive. Other events like corporate alliances can also
enable this gate. Hence, we revised this model based on recurring disagreements from
the analysts.
We evaluated these revisions effect on the accuracy of the maturity assessments
given by the Technology Investment Radar. We enlisted 5 new analysts from the
Wireless CoP and had them provide maturity assessments using the same methodology
as above. Table 2 shows the result of this evaluation.
The overall agreement between the Technology Investment Radar and the new
human analysts (and hence the new accuracy of the system) is 74.28%. Compared with
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 161
the overall agreement from before (i.e. 54.55%), this difference is statistically
significant according to the 2 test (p < 0.05).
Some technologies still had low agreement – e.g. HSDPA and HSUPA – but the
reason was due to disagreement among the human analysts. In each case, the maturity
assessment given by the Technology Investment Radar was the same assessment given
by the majority of the analysts.
Table 2. The effect that the revised model of the business dynamics had on the agreement between the
maturity assessments given by the Technology Investment Radar and human analysts for the wireless
technologies tracked.
We also assessed, qualitatively, the utility of the Technology Investment Radar from an
end user perspective (e.g. will the analysts continue to use the system after the pilot,
how satisfied are the analysts with the tool, and so forth). This assessment was done
through an exist survey administered to the analysts from the pilot study.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
The survey was completely anonymous. It was hosted on a third party survey
hosting site where the identities of the respondents were not known to us. Hence, the
respondents were not under any pressure to respond favorably. The survey consisted of
25 questions, but given the limitation in space, we will not present responses from all
these questions. Instead, we give an overview of the highlights from the survey based
on preliminary responses from 9 of the analysts.
• When asked to indicate their overall satisfaction with the system – the possible
answer choices are very satisfied, somewhat satisfied, neutral, somewhat
dissatisfied, and very dissatisfied – 66.7% of the analysts said very satisfied,
22.2% said somewhat satisfied, and 11.1% said neutral. No analysts gave a
somewhat dissatisfied or very dissatisfied response.
• When asked to indicate if they will continue to use the system after the pilot
study – the possible answer choices are yes and no – 88.9%of the analysts said
yes and 11.1% said no.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
162 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building
• When asked if they would recommend the system to a colleague – the possible
answer choices are yes and no – 88.9% of the analysts said yes and 11.1% said
no.
• When asked to indicate how using the Technology Investment Radar to track
technology maturation compared to their current method – the possible answer
choices are much better, somewhat better, about the same, somewhat worse,
and much worse – 22.2% of the analysts said much better, 66.7%said
somewhat better, 0.0% said about the same, and 11.1% said somewhat worse.
No analysts gave a much worse response.
These responses show that the majority of the analysts found the Technology
Investment Radar to be useful and will continue to use the system after the pilot. These
responses also demonstrate the value of corporate radars that can automatically mine
the Web for business insights that are relevant to decision maker’s organization – in
this case insight regarding the maturity of technologies that impact the Wireless CoP.
maker’s business. The Technology Investment Radar was piloted with business users,
and we presented initial results from this pilot, which begin to demonstrate the value
that these kinds of systems can actually provide to real users.
Our initial experience with these prototypes – and the pilot results – has been
encouraging. However, there remain several issues that must be addressed in order to
turn our work into a solution that business analysts can readily use to build robust
radars customized for their concerns. This goal will require support that allows
business analysts, who are not trained knowledge engineers, to customize and create
the semantic models. We have begun developing a set of GUI based business
environment modeling tools that will enable business analysts to perform this task. We
are also leveraging previous research on enabling subject matter experts to author
semantic models without the aid of knowledge engineers as part of this effort [3].
Achieving a more robust solution will also require either automated or semi-automated
methods that can enhance and update the semantic models used by corporate radars.
Although using a generic upper ontology as a starting point improves reuse and reduces
the customization effort required, a sizeable amount of time is still required to extend
this upper ontology for each new radar application and to update these models once an
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building 163
organization’s concerns change. We are exploring approaches that can enhance and/or
update existing models based on emerging patterns of detected events – e.g. repeating
patterns of product introduction events coming from a competitor’s supplier, followed
by product feature change events coming from that competitor, might be recognized by
a rule learning system to create a rule whereby product introductions imply feature
changes in appropriately related entities.
Acknowledgement
This chapter draws on previous papers we have written on this topic, including
“Business Event Advisor: Mining the Net for Business Insight with Semantic Models,
Lightweight NLP, and Conceptual Inferences” and “Using Lightweight NLP and
Semantic Modeling to Realize the Internet’s Potential as a Corporate Radar”. We like
to thank Chris Cowell-Shah who contributed to the authoring of these papers and to the
evolution of our thinking on corporate radars. We also want thank Chris for the
implementation of the Business Event Advisor.
References
[1] K. Barker, Semi-Automatic Recognition of Semantic Relationships in English Technical Texts, PhD
thesis, University of Ottawa, 1998.
[2] K. Barker, B. Porter, and P. Clark, A Library of Generic Concepts for Composing Knowledge Bases,
KCAP, 2001.
[3] K. Barker et al., A Knowledge Acquisition Tool for Course of Action Analysis, IAAI, 2003.
[4] T. Berners-Lee, J. Hendler, and O. Lassila, The Semantic Web: A New Form of Web Content that is
Meaningful to Computers Will Unleash a Revolution of Possibilities, Scientific American, 2001.
[5] P. Clark and B. Porter, KM: The Knowledge Machine, Technical Report,
http://www.cs.utexas.edu/users/mfkb/RKF/km.html.
[6] C. Fellbuam, WordNet: An Electronic Lexical Database, MIT Press, 1998.
[7] D. Gildea and D. Jurafsky, Automatic Labeling of Semantic Roles, Computational Linguistics 28(3),
2002.
[8] K. Hacioglu, Semantic Role Labeling Using Dependency Trees, COLING , 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
[9] A. Kass and C. Cowell-Shah, Business Event Advisor: Mining the Net for Business Insight with
Semantic Models, Lightweight NLP, and Conceptual Inference, KDD Workshop on Data Mining for
Business Applications, 2006.
[10] O. Lassila and R. Swick, Resource Description Framework (RDF) Model and Syntax Specification,
Technical Report, W3C, 1999.
[11] M. Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine
Cone from an Ice Cream Cone, 5th International Conference on Systems Documentation, 1986.
[12] B. McBride, Jena: Implementing the RDF Model and Syntax Specification, Semantic Web Workshop,
2001.
[13] R. Mihalcea and D. Moldovan, An Iterative Approach to Word Sense Disambiguation, FLAIRS, 2000.
[14] E. Sirin, B. Parsia, B. Grau, A. Kalyanpur and Y. Katz, Pellet: A Practical OWL-DL Reasoner,
Technical Report UMIACS, 2005.
[15] J. Sowa, Conceptual Structures: Information Processing in Mind and Machine, Addison-Wesley, 1984.
[16] R. Sweir and S. Stevenson, Exploiting a Verb Lexicon in Automatic Semantic Role Labeling,
HLT/EMNLP, 2005.
[17] F. Vogelstein, Gates vs. Google: Search and Destroy, Fortune 151(9), 2005.
[18] P. Yeh, B. Porter, and K. Barker, A Unified Knowledge Based Approach for Sense Disambiguation and
Semantic Role Labeling, AAAI, 2006.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
164 Data Mining for Business Applications
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
doi:10.3233/978-1-60750-633-1-164
Introduction
Over the past years the interest in spatial data has clearly been pushed by the
wide availability of recording technologies such as the Global Positioning Sys-
tem (GPS), mobile phone data or radio frequency identification (RFID). Today,
nearly all database systems support data types for the storage and processing
of geographic data. However, knowledge discovery from geographic data is still
a young research direction. In classic data mining many algorithms extend over
multi-dimensional feature space and are thus inherently spatial. Yet, they are not
necessarily adequate to model geographic space.
Spatial data mining combines statistics, machine learning, databases and vi-
sualization with geographic data. The task is to identify spatial patterns or ob-
jects that are potential generators of such patterns. This includes also the iden-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
166 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
This section presents three case studies in marketing and planning which utilize
vector data for their analysis. The first study forecasts sales at potential new
locations for a trading company and emphasizes the handling of large amounts of
spatial features. The second and third study apply visual analytics and subgroup
discovery for customer segmentation and the optimization of mobile networks
respectively.
Choosing the appropriate site is crucial for the success of every retailing company.
From a microeconomic point of view, the expected sales at a location are the most
important decision criterion for the evaluation of potential new sites. However,
sales forecasting is still a great challenge in retail location planning today. How
can sales at potential new locations be predicted? And which factors influence
sales the most?
Our project partner is one of Austria’s leading trading companies. In order
to reduce the risk in location decisions while continuing growth, the company
sought for an automated sales forecasting solution to evaluate possible new sites.
In our project we identified and quantified the most important factors influenc-
ing sales at operating store locations for three different product lines and store
formats (supermarket, hypermarket and drugstore). The main challenge of the
project was to handle an abundance of attributes which possessed diverse levels
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
of spatial resolution and for which the most appropriate resolution was not known
beforehand. We applied support vector machines (SVM) for the regression task
as they are robust in the face of high-dimensional data. SVMs are not spatial by
themselves, therefore we conducted extensive feature extraction during which all
spatial operations were performed.
The training set for model learning was made up of about 1,400 existing
stores from all over Austria and a broad variety of socio-economic, demographic
and market data on different administration levels as well as competitor informa-
tion and points of interest (POI). Most of the socio-economic and market data
were available on hierarchical spatial aggregation levels of states, districts resp.
cities, municipalities and Zählsprengel as well as post code areas. Zählsprengel
are subunits of municipalities at the lowest spatial aggregation level for which
official statistics are available (around 1,000 inhabitants on average). They proved
to be especially valuable for modeling purposes because they reflected most of the
spatial variability.
In order to characterize the environment of individual shops, we first built
trading areas for which socio-economic, demographic, competitor and POI infor-
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 167
mation was aggregated. The feature extraction process for each source of infor-
mation is described in more detail in the following paragraphs. Generally, ag-
gregation can be performed using buffers or drive time zones. They mark, for a
fixed location, the area which lies within a given range or which can be reached
within a given time respectively. However, location factors show different effects
on different levels of spatial aggregation, and it had been unknown beforehand
which levels would yield the highest impact. For instance, if attributes had been
taken into account solely based on 5-minutes drive time zones, important posi-
tive shopping linkages which mostly appear within the range of a 3-minutes walk
would have been lost. Therefore, we built several trading areas with varying spa-
tial extent based on drive time zones for cars and pedestrians (1-5 and 1-3 minutes
respectively) as well as buffers with a distance between 100 and 500 meters based
on the street network. This resulted in a total of 13 trading areas per store.
Naturally, the trading areas did not correspond to the spatial units by which
the socio-economic and demographic data were provided. Therefore, an assign-
ment of attribute values in proportion to the intersecting area of a trading cell
and other spatial units was made. Let ta denote the trading area of interest and
u ∈ U the spatial units that carry some attribute a(). The assignment is specified
by:
area(ta ∩ u)
a(ta) = ∗ a(u).
area(u)
u∈U
to a competing retail chain, shop type and size, estimated turnover as well as
opening hours. It is crucial to incorporate competitive effects in the forecasting
model because competition is always a strong determinant of the amount of own
sales. However, it is important to notice that competition must not necessarily
be negative; it can also have positive impacts by raising the cumulative attrac-
tion of a site. We further took the distances to own shops of the company as
competitive factor into account because they also draw off sales from a location
(a phenomenon which is called retail cannibalization). Last, we included the ge-
ographical coordinates of the locations into our model to account for local and
global trends.
We created location specific geographic features as, for example, population
density, centrality, accessibility or a site’s shopping linkage potential. To assess
the shopping linkage potential, we evaluated 3,000 different branches of POI with
regard to their individual interception potential for the product lines and retail
formats of our project partner. We again ascertained the number of relevant POI
because it was expected that a high number of affine POI would increase the
attraction of a site and thus lead to higher sales. The process of geographical
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
168 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
One of the leading German gas suppliers in the B-to-B market provides database
marketing services for power authorities and local energy providers. A main chal-
lenge in this domain is to provide reliable knowledge about gas customers: What
are the main factors which influence customer interest in natural gas? How can
potential customers be reliably classified according to these characteristics? And
how can this knowledge be used to automatically support the selection of ad-
dresses for direct marketing purposes?
Spatial data mining and knowledge discovery are considered to be a promis-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
ing way to deal with the above challenges, as the application involves the devel-
opment of models with geographically constrained validity, models using indirect
and contingent relations on geographical objects as well as efficient methods for
discovering this knowledge. The goals within our project were to a) find reliable
candidate features for customer description and b) classify addresses according
to the probability of customer interest in a sales representative visit.
The empirical basis of the study was a combined database of nationwide
address data with description of buildings, a database of discrete geographical
objects as rivers and elevation fields from a topographical map and a georeferenced
sample of response data from about 500,000 nationwide interviews (see left plot
in Figure 1).
In the data preparatory step the regional sample of response data was enriched
with building data and geographic context. Thereby, the relation between the
regional sample and the building data was realized by georeferencing the given
addresses. The enriched sample served as training set for the analysis of interesting
and statistically extraordinary subgroups and for the construction of a model for
rule-based classification of addresses with high response probability.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 169
Figure 1. left: data basis with georeferenced addresses and geographical objects; right: visual
exploratory analysis of customer response ratio
In a first step we explored the data using techniques from visual analytics.
Subsequently, the resulting hypotheses were tested for statistical significance using
binomial tests and subgroup mining. Visual analytics is the “science of analytical
reasoning facilitated by interactive visual interfaces” [8]. Especially in geographic
context the visualization of information plays an important role to profit from
background knowledge, flexible thinking and imagination of human analysts [9].
Subgroup discovery detects groups of objects with common characteristics
that show a significant deviation in their target value with respect to the whole
data set. In our application we searched for subgroups with a significantly larger
response probability to marketing campaigns than in general. The quality of a
subgroup depends on a quantitative and a qualitative term, which measure the
size of a subgroup and the pureness of the target attribute within the subgroup
respectively. More precisely, the quality q of a subgroup h is defined as
|p − p0 | √
q(h) = n
p0 (1 − p0 )
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
and accounts for the difference of target share between the subgroup p and the
whole data set p0 , as well as the size n of the subgroup [10]. Spatial subgroups
are formed if the subgroup definition involves operations on spatial components
of the objects. However, spatial operations are expensive. They lead to a loss
of performance during execution or require additional storage when computed
in advance. Klösgen and May [11] developed a spatial subgroup mining system,
which integrates spatial feature extraction into the mining process. They exploit
the fact that it may not be necessary to compute all spatial relations due to early
pruning in the mining process. The spatial joins are performed separately on each
search level, which reduces the number of spatial operations and avoids redundant
storage of features.
One major result of our study was that geographic relations, such as river
distance and ground elevation, as well as the age of buildings can be used to
improve the response probability of a sample of addresses. One example for an
interesting subgroup of customers were people using heating oil instead of gas
and living within 1 km distance from a larger river, which could be explained by
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
170 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
the specific flooding risk for oil tanks. Figure 1 (right) shows an example of this
pattern as experienced during visual exploratory analysis.
The quality and coverage of a mobile network are key factors for the success of
a mobile telecommunication company. In order to support decisions about the
extension and optimization of such a network, we analyzed the capacity, quality
and cost-effectiveness of the mobile network of one of the leading German mobile
telecommunication companies. The goal of the project was to identify rural areas
with a high demand in mobile network services and to relate the demand to
demographic and geographic characteristics.
Mobile networks extend over geographic space and therefore make a strong
claim on the inclusion of geographic data in the analysis process. A first explo-
rative data analysis showed, for example, a decreasing network quality within
cells in increasingly hilly areas. The overall input data of the project consisted
of network usage, demographic and geographic information. In the data prepara-
tory step we merged all three kinds of data and aggregated attribute values such
as population and POI for radio network cells. In addition, we defined a target
attribute which describes the demand of (future) network services:
number of calls
cell potential = ∗ population.
number of customers
It weights the population of an area with the average number of calls of the
present customers. Similar to the above project about customer segmentation,
we applied subgroup discovery to detect variables that influence the demand for
mobile network services. We used the SPIN! [12] spatial data mining platform,
which has been developed within the EU project IST-1999-10536 SPIN!. It joins
the power of data mining and geographic information systems by the implemen-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 171
Figure 2. left: subgroup patterns based on network usage in the area of Stuttgart; right: subgroup
patterns based on geographic information in the area of Stuttgart
In this section we present two case studies which involve network data and ge-
ographically referenced time series. The first study develops a traffic frequency
map and emphasizes the tight integration of feature extraction and data mining
algorithms for performance optimization. The second study extracts customer
movements from tabloid sales data.
ing vehicles, pedestrians and public transport while the qualitative term specifies
the average notice of passers-by. As part of an industrial project we developed
a frequency map for German cities which today forms an essential part of price
calculations in the German outdoor advertisement.
Essential for the prediction of traffic frequencies are the exploitation of geo-
graphic neighborhood, inclusion of background knowledge and performance op-
timization. We therefore applied a modified k-nearest neighbor (kNN) algorithm
[13]. Nearest neighbor algorithms are generally able to incorporate spatial and
non-spatial information based on the definition of appropriate distance functions.
Thus, they are inherently spatial and exploit autocorrelation as a matter of prin-
ciple. In order to gain background knowledge about the vicinity of a street, several
geographically referenced attributes were aggregated. Furthermore, the large do-
main required a tight integration of spatial feature extraction and the algorithm
in order to reduce expensive spatial operations.
The input data comprised several sources of different quality and resolution.
The primary objects of interest were street segments, which generally denote
a part of street between two intersections. Each segment possessed a geometry
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
172 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
object and had attached information about the type of street, direction, speed
class etc. Germany contains in total 6.2 million street segments, for which about
100,000 traffic measurements were available. In addition, demographic and socio-
economic data about the vicinity as well as nearby POI were known. Demographic
and socio-economic data usually exist in aggregated form, for example, for official
districts like post code areas. This information was likewise assigned to all street
segments in an area. In contrast, POI are point data that mark attractive places
like railway stations or restaurants. Clearly, areas with a high POI density are
more frequented than areas with a low density. In order to obtain density infor-
mation the POI data were aggregated. Two basic aggregation methods are buffers
and drive time zones. As explained earlier, they mark, for a fixed location, the
area which lies within a given range or which can be reached within a given time
respectively. Drive time zones emphasize network constraints related to topology
and allowed speed. Imagine, for example, two locations on opposite sides of a
river. Their spatial distance is small, but the travel time between them depends
on the location of the next bridge. For our application we created buffers around
each street segment and calculated the number of relevant POI.
The central part of our traffic frequency prediction is a modified kNN al-
gorithm, which models geographic space as a subcomponent of the general at-
tribute space. The distance between two segments xa and xb is defined as the
(normalized) sum of absolute distances of their attributes
m
d(xa , xb ) = |xai − xbi |.
i=1
For fine tuning, the attributes were assigned domain dependent weights, which
we will not discuss here further. The frequency y0 of a street segment is calculated
as the normalized weighted sum of frequencies from the k nearest neighbors, each
weight indirectly proportional to the distance between the two segments
k
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
wi y i 1
y0 = i=1
k
with wi = .
i=1 wi d(x0 , xi )
The kNN algorithm is known to use extensive resources as the distances be-
tween each street segment and all available measurements have to be calculated.
For a city like Frankfurt this amounts to 43 million calculations (about 21,500
segments and 2,000 measurements). While differences in numerical attributes can
be determined very fast, the geographic distance between line segments is compu-
tationally expensive. We therefore implemented the algorithm to perform a dy-
namic and selective calculation of distance from each street segment to the various
measurement locations. First, at any time distances to only the top k neighbors
are stored, replacing them dynamically during the iteration over measurement
sites. Second, a step-wise calculation of distance is applied. If the summarized
distance of all non-spatial attributes already exceeds the maximal total distance
of the current k neighbors, the candidate neighbor can be safely discarded and no
spatial calculation is necessary. Else, the distance between the minimum bound-
ing rectangles (MBRs) of the line segments is calculated. The MBR distance is a
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 173
lower bound for the actual distance between the line segments and less expensive
to calculate. Again, if the distance of the non-spatial attributes plus the distance
between the MBRs is greater or equal to the threshold, the instance can be dis-
carded. Only if both tests are passed, the actual spatial distance is determined.
For the city of Frankfurt, this integrated approach sped up calculations from
nearly one day to about two hours. In addition, the dynamic calculations reduced
the required disc space substantially.
In recent years, companies have spent great effort to systematically profile their
customers in order to gain insights into target group characteristics and shopping
behavior. Newspaper companies are especially interested in purchasing behavior
as they face the challenge to supply each point of sale (POS) with the number
of copies that are expected to be sold the next day. Existing forecasting systems
employ complex time series models to predict the expected sales. However, they
are bound to the temporal dimension and lack the understanding of local market
situations and the customers’ movement behavior at a particular selling point.
Clearly, closures and sellouts influence shopping behavior and lead to one key
business question: Where do readers buy a specific newspaper if their preferred
shop is closed or has no more copies left? In an industry project with the publisher
of Europe’s leading tabloid newspaper, we developed a spatial model to detect
and visualize local customer behavior.
The data basis for our model were approximately 110,000 POS, irregularly
distributed over Germany. For each object a triannual history of sales figures
was available. All objects were equipped with location information and could be
mapped to a network of street segments. Information about the street network
restrains vehicular as well as pedestrian movement and therefore simplifies the ge-
ographic space of possible movement. In addition, socio-demographic data about
the vicinity of a POS as well as nearby POI were known. Both are needed to bet-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
ter understand, explain and learn the movement behavior of local target groups.
For example, certain patterns or habits might correlate with certain demographic
attributes or POI.
Shopping behavior is influenced by intrinsic as well as extrinsic factors [14,15].
This includes the individual destination, spatial barriers, mood (activation) and
available selling points. In our model we assume that readers follow some routine.
For example, the reader may buy the newspaper at his/her preferred selling point
along his/her way to work. Such a routine can easily be interrupted by external
factors as, for example, sellouts, vacation or openings of new shops requiring the
customer to adapt his/her behavior. The challenge of the project was to detect,
quantify and learn the behavior of customers after any such event and to predict
the amount of copies that are additionally sold in alternative shops. Clearly,
without personalized data customer movement can hardly be traced over a whole
city. We therefore restricted our analysis to the local environment of a POS.
The first task in learning local movement patterns was to define a reasonable
spatial unit for movement detection, which we call movement space (see left plot
in Figure 3). If the unit is set too large, movement patterns will be lost in general
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
174 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
noise or overlaid by side effects due to events at other POS. Limiting the space
too strongly, however, reduces the chance to detect reasonable movement patterns
within. We employed two criteria to define the size of the unit, namely drive
time zones and Voronoi neighbors. Drive time zones were used to set the initial
(and maximal) extent of the movement space according to typical pedestrian
walking speed. This area was further restricted based on the assumption that
people who immediately seek an alternative POS will not pass by two alternative
POS without buying. Of course, the individual choice depends on the knowledge
of each customer about the set of selling points in his/her range (awareness set).
In order to limit the movement space, we calculated the convex hull of the second
order POS Voronoi neighbor (see right plot in Figure 3). The resulting area was
the space in which we looked for additional newspaper sales as an indicator for
movement if the service at some POS had been unavailable. We call the set of all
POS inside the movement space optional shops.
Figure 3. left: movement space of a particular POS showing the convex hull of the second order
Voronoi neighbor (dark gray area) and the initial drive time zone of the POS (bold gray street
segments); right: second order Voronoi neighbors of a POS with respect to natural barriers
The basic idea to detect local movement patterns in case of a changed local
market situation (closures, sellouts, etc.) is to predict the sales of all optional
shops assuming a typical shopping behavior and to compare the prediction with
their actual sales. All shops showing an increased sale are likely to gain customers
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
from the considered shop. In order to predict the expected number of copies at
some POS, we calculated the sales based on shops with similar selling trends in
the recent past. These shops are called reference shops. The reference shops were
dynamically determined by maximizing the similarity in selling trends applying a
two week window before the registered event of the original POS. In this way, also
seasonal or regional trends could be anticipated. Of course, all reference shops
have to be located outside the movement space in order to be independent of any
event-driven movement caused by the POS under consideration. If an optional
shop sells a certain amount of copies above the expected number, it is likely that
customers of the considered POS buy their newspaper alternatively at that point.
Over time we gain robust knowledge about the movement behavior of the local
customer base as well as a set of alternative shops inside the movement space.
With this knowledge newspaper companies can optimize the number of copies
they deliver to each POS, taking into account not only time variant information
but also the current local market situation. Moreover, the information about cus-
tomer behavior provided by movement spaces allows to optimize location planning
and to calculate the effect of opening or closing a POS in a specific area.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 175
Over the past five years GPS technology has steadily conquered its place in mass
market and is on the threshold to become an every day companion. Besides the
application in navigation systems, enterprises have also recognized the value of
movement histories. The outdoor advertising industries of Germany and Switzer-
land commissioned nationwide GPS field studies to collect representative samples
of mobile behavior. The data are used to calculate reach and gross contacts of
poster campaigns for specific target populations.
This section describes a general approach for mobility mining in outdoor ad-
vertisement and highlights challenges of current industrial projects for Arbeitsge-
meinschaft Media-Analyse e.V. (ag.ma) in Germany and Swiss Poster Research
Plus (SPR+) in Switzerland.
The reach of a campaign states the percentage of population which has at least
one contact with any poster of the campaign within a specified period of time.
Poster reach allows to determine the optimal duration of some advertisement and
to tune the configuration of poster networks as it expresses the publicity of some
location and the spread of information within the population.
Given trajectories for a sample of the population and geographic coordinates
of poster locations, the contacts with a given poster campaign can be extracted by
spatial intersection and the reach can be determined. One challenge of calculating
poster reach lies in the incompleteness of sample trajectories. For example, many
trajectories are incomplete due to technical defects or because people forget (to
switch on) their GPS devices. In addition, people tend to drop out of the study
early, which leads to a decreasing number of participants with advancing time.
What possibilities exist to handle incomplete data? In general, missing data can
be treated in the data preparation step or within the modeling process. During
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
data preparation, incomplete data objects can either be removed, ignored or tried
to fill in by modeling. However, none of these possibilities are practicable in our
application. First, if incomplete data objects are removed, the size of the data
set decreases drastically because only a few test persons produce trajectories for
the whole surveying period. Second, ignoring missing data leads to an underes-
timation of poster contacts and thus to an underestimation of poster reach as
well. Finally, the reconstruction of missing trajectories is a fairly complex and
ambitious task. We therefore treat missing data explicitly in the modeling step,
applying a technique from the area of event history analysis.
Event history analysis (also survival analysis) [16] is a branch of statistics
that investigates the probability of some event or the amount of time until a
specific event occurs. It is usually applied in clinical studies and quality control
where an event denotes, for example, the occurrence of some disease or the failure
of a device. In our application an event denotes the first contact of a test person
with a poster campaign. To calculate poster reach, we apply the Kaplan-Meier
method which allows for censored data. This method adapts to differing sample
sizes by calculating conditional probabilities between two consecutive events. If
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
176 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
no more data of a test person are available, the person is assumed to survive until
the next event occurs and is censored afterwards. Thus, a gradual adjustment to
the actual number of people in the sample is achieved.
The ag.ma, a joint industry committee of German advertising vendors and cus-
tomers, commissioned a nationwide survey to collect mobility data using two
different surveying technologies. From a total of about 30,000 test persons, one
third was provided with GPS devices while the other test persons where queried
about their movements in a Computer Assisted Telephone Interview (CATI). One
task of the project was to analyze both data sets according to their content and
structure-related differences, and to combine the data sets for modeling if possible.
Both surveying techniques bear the risk of incomplete and erroneous data.
GPS devices may easily be forgotten or passed on to other family members while
telephone interviews demand a precise and complete recollection of the activities
of the previous day. We therefore compared the mobile behavior of both data sets.
The analysis showed similar movement behavior as, for example, in the average
number of trips per day or the average distance traveled.
The main structural difference of the data sets are the different surveying
periods. While all GPS test persons collected data over a period of one week,
CATI test persons were asked about their movements on the previous day of the
interview only. However, a combination of both data sets with regard to their
structure was possible due to the adaptive character of Kaplan-Meier. As Kaplan-
Meier censors missing days, the modeling process is robust against varying lengths
of surveying periods.
In our project with SPR+ we investigate further research questions that concern
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
the prediction of reach when only a limited number of measurements are available.
The first task is to predict poster reach when the measurement period is shorter
than the desired interval of time. The second challenge is to predict poster reach
in a city where no measurements at all are available. In this case, the reach of a
given campaign within one city has to be inferred from the mobility of another
(similar) city.
For the extrapolation of reach beyond the surveying period, we combine two
different extrapolation techniques. The first technique utilizes the reach of one
week to fit a log-linear function and subsequently extrapolates values for longer
periods. The second technique relies on the assumption of weekly periodic mo-
bility patterns and replicates mobile behavior accordingly. Both techniques are
interweaved according to the stability of available data.
The extrapolation for areas without GPS measurements is a great challenge.
Neither GPS data nor other mobility information as, for example, traffic frequen-
cies are available. In addition, individual poster characteristics which affect the
intensity of a contact need to be taken into account for the calculation of reach.
The extrapolation method therefore consists of three separate steps. First, the
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies 177
traffic behavior at the poster locations of interest is inferred. Second, the pas-
sages are scaled according to individual poster characteristics. Finally, the reach
of a campaign with a similar contact distribution is assigned to the campaign of
interest. In the first step, various location attributes such as the type of street,
type and number of nearby POI or the size of population define a similarity mea-
sure by which poster passages are extrapolated. In the next step, a scaling factor
which transforms passages into poster contacts is applied. The factor depends on
individual poster characteristics and is determined based on evaluations in GPS
cities. The final assignment of poster reach depends again on a similarity mea-
sure which is defined on the contact distribution of the campaign of interest. The
extrapolation method thus accounts for general traffic characteristics, yet allows
for individual features of poster campaigns. In order to validate our extrapola-
tion method, we applied the technique in a city with GPS measurements. The
comparison of modeled and extrapolated values showed a high correlation.
4. Summary
Acknowledgment
The authors would like to thank all business partners for their close cooperation.
The publication of this chapter would have been impossible without their interest
and participation. Parts of this work have been inspired by research within the
EU projects IST-1999-10536 SPIN! (Spatial Mining for Data of Public Interest)
and IST-6FP-014915 GeoPKDD (Geographic Privacy-aware Knowledge Discov-
ery and Delivery). Finally, the authors acknowledge and thank all members of the
Department of Knowledge Discovery who have contributed by their research and
constant work to the success of the presented projects.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
178 C. Körner et al. / Spatial Data Mining in Practice: Principles and Case Studies
References
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 179
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
Subject Index
algorithms 164 forecasting 137
business applications 149 functional data analysis 137
business intelligence 149 hierarchical clustering 84, 99
case studies 164 human computer interaction 17
churn 77 inference 149
classification 99 interactivity 17
competitive and market load profiles 99
intelligence 149 medical knowledge discovery 110
customer churn 77 NLP 149
data cleaning 84 online auctions 137
data mining outlier detection 84
~ applications 1 outlier ranking 84
~ process 1 performance support 149
~ stakeholders 49 PKB 110
spatial ~ 164 retail banking 77
utility-based ~ 49 rigor vs. relevance in research 49
dynamics 137 semantic models 149
eBay 137 subgroup discovery 17
electricity markets 99
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Data Mining for Business Applications 181
C. Soares and R. Ghani (Eds.)
IOS Press, 2010
© 2010 The authors and IOS Press. All rights reserved.
Author Index
Ahola, J. 77 May, M. 164
Blumenstock, A. 17 Morrall, R. 110
Breitenbach, M. 123 Mueller, M. 17
Brennan, T. 123 Mutanen, T. 77
Bruckhaus, T. 66 Nousiainen, S. 77
Ceglar, A. 110 Pechenizkiy, M. 49
Dieterich, W. 123 Puuronen, S. 49
Domingos, R. 35 Roddick, J.F. 110
Figueiredo, V. 99 Rodrigues, F. 99
Ghani, R. v, 1 Scheider, S. 164
Grudic, G. 123 Schulz, D. 164
Guthrie, W.E. 66 Shmueli, G. 137
Hecker, D. 164 Soares, C. v, 1, 84
Hipp, J. 17 Stange, H. 164
Jank, W. 137 Torgo, L. 84
Kass, A. 149 Vale, Z. 99
Kempe, S. 17 Van de Merckt, T. 35
Körner, C. 164 Wirth, R. 17
Krause-Traudes, M. 164 Wrobel, S. 164
Lanquillon, C. 17 Yeh, P.Z. 149
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
This page intentionally left blank
Copyright © 2010. IOS Press, Incorporated. All rights reserved.
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.