0% found this document useful (0 votes)

41 views196 pages

Data Mining For Business Applications

Uploaded by

oucrkqnthiysznpksg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views196 pages

Data Mining For Business Applications

Uploaded by

oucrkqnthiysznpksg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 196

DATA MINING FOR BUSINESS APPLICATIONS

Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
Frontiers in Artificial Intelligence and
Applications
FAIA covers all aspects of theoretical and applied artificial intelligence research in the form of
monographs, doctoral dissertations, textbooks, handbooks and proceedings volumes. The FAIA
series contains several sub-series, including “Information Modelling and Knowledge Bases” and
“Knowledge-Based Intelligent Engineering Systems”. It also includes the biennial ECAI, the
European Conference on Artificial Intelligence, proceedings volumes, and other ECCAI – the
European Coordinating Committee on Artificial Intelligence – sponsored publications. An
editorial panel of internationally well-known scholars is appointed to provide a high quality
selection.

Series Editors:
J. Breuker, N. Guarino, J.N. Kok, J. Liu, R. López de Mántaras,
R. Mizoguchi, M. Musen, S.K. Pal and N. Zhong

Volume 218
Recently published in this series

Vol. 217. H. Fujita (Ed.), New Trends in Software Methodologies, Tools and Techniques –
Proceedings of the 9th SoMeT_10
Vol. 216. P. Baroni, F. Cerutti, M. Giacomin and G.R. Simari (Eds.), Computational Models of
Argument – Proceedings of COMMA 2010
Vol. 215. H. Coelho, R. Studer and M. Wooldridge (Eds.), ECAI 2010 – 19th European
Conference on Artificial Intelligence
Vol. 214. I.-O. Stathopoulou and G.A. Tsihrintzis, Visual Affect Recognition
Vol. 213. L. Obrst, T. Janssen and W. Ceusters (Eds.), Ontologies and Semantic Technologies
for Intelligence
Vol. 212. A. Respício et al. (Eds.), Bridging the Socio-Technical Gap in Decision Support
Systems – Challenges for the Next Decade
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Vol. 211. J.I. da Silva Filho, G. Lambert-Torres and J.M. Abe, Uncertainty Treatment Using
Paraconsistent Logic – Introducing Paraconsistent Artificial Neural Networks
Vol. 210. O. Kutz et al. (Eds.), Modular Ontologies – Proceedings of the Fourth International
Workshop (WoMO 2010)
Vol. 209. A. Galton and R. Mizoguchi (Eds.), Formal Ontology in Information Systems –
Proceedings of the Sixth International Conference (FOIS 2010)
Vol. 208. G.L. Pozzato, Conditional and Preferential Logics: Proof Methods and Theorem
Proving
Vol. 207. A. Bifet, Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data
Streams
Vol. 206. T. Welzer Družovec et al. (Eds.), Information Modelling and Knowledge Bases XXI

ISSN 0922-6389 (print)

ISSN 1879-8314 (online)

Edited by
Carlos Soares
LIAAD-INESC Porto L.A./Faculdade de Economia, Universidade do Porto,
Portugal
and
Rayid Ghani
Accenture Technology Labs, U.S.A.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Amsterdam • Berlin • Tokyo • Washington, DC

All rights reserved. No part of this book may be reproduced, stored in a retrieval system,
or transmitted, in any form or by any means, without prior written permission from the publisher.

ISBN 978-1-60750-632-4 (print)

ISBN 978-1-60750-633-1 (online)
Library of Congress Control Number: 2010934192

Publisher
IOS Press BV
Nieuwe Hemweg 6B
1013 BG Amsterdam
Netherlands
fax: +31 20 687 0019
e-mail: [email protected]

Distributor in the USA and Canada

IOS Press, Inc.
4502 Rachael Manor Drive
Fairfax, VA 22032
USA
fax: +1 703 323 3668
e-mail: [email protected]
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

LEGAL NOTICE
The publisher is not responsible for the use which might be made of the following information.

PRINTED IN THE NETHERLANDS

Preface
The field of data mining is currently experiencing a very dynamic period. It has reached
a level of maturity that has enabled it to be incorporated in IT systems and business
processes of companies across a wide range of industries. Information technology and
E-commerce companies such as Amazon, Google, Yahoo, Microsoft, IBM, HP and Ac-
centure, are naturally at the forefront of these developments. In addition, data mining
technologies are also getting well established in other industries and government sectors,
such as health, retail, automotive, finance, telecom and insurance, as part of large corpo-
rations such as Siemens, Daimler, Walmart, Washington Mutual, Progressive Insurance,
Portugal Telecom as well as in governments across the world.
As data mining becomes a mainstream technology in businesses, data mining re-
search has been experiencing explosive growth. In addition to well established applica-
tion areas such as targeted marketing, customer churn, and market basket analysis, we
are witnessing a wide range of new application areas, such as social media, social net-
works, and sensor networks. In addition, more traditional industries and business pro-
cesses, such as healthcare, manufacturing, customer relationship management and mar-
keting are also applying data mining technologies in new and interesting ways. These
areas pose new challenges both in terms of the nature of the data available (e.g., complex
and dynamic data structures) as well as in terms of the underlying supporting technology
(e.g., low-resource devices). These challenges can sometimes be tackled by adapting ex-
isting algorithms but at other times need new classes of techniques. This can be observed
by looking at the topics being covered at existing major data mining conferences and
journals as well as by the introduction of new ones.
A major reason behind the success of the data mining field has been the healthy
relationship between the research and the business worlds. This relationship is strong
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

in many companies where researchers and domain experts collaborate to solve practical
business problems. Many of the companies that integrate data mining into their products
and business processes also employ some of the best researchers and practitioners in the
field. Some of the most successful recent data mining companies have also been started
by distinguished researchers. Even researchers in universities are getting more connected
with businesses and are getting exposed to business problems and real data. Often, new
breakthroughs in data mining research have been motivated by the needs and constraints
of practical business problems. This can be observed at data mining scientific confer-
ences, where companies are participating very actively and there is a lot of interaction
between academia and industry.
As part of our (small) contribution to strengthen the collaboration between compa-
nies and universities in data mining, we have been helping organize a series of workshops
on Data Mining for Business Applications, with major conferences in the field:
• “Data Mining for Business” workshop, with ECML/PKDD, organized by Car-
los Soares, Luís Moniz (SAS Portugal) and Catarina Duarte (SAS Portugal),
which was held in Porto, Portugal, in 2005 (http://www.liaad.up.pt/
dmbiz/).
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
vi

• “Data Mining for Business Applications” workshop, with KDD, organized by

Rayid Ghani and Carlos Soares, in Philadelphia, USA, in 2006 (http://labs.
accenture.com/kdd2006\%5Fworkshop/).
• “Practical Data Mining: Applications Experiences and Challenges” workshop,
with ECML/PKDD, organized by Markus Ackermann (Univ. of Leipzig), Carlos
Soares and Bettina Guidemann (SAS Deutschland), which took place in Berlin,
Germany, in 2006 (http://wortschatz.uni-leipzig.de/~macker/
dmbiz06/).
• “Data Mining for Business Applications” workshop, with KDD, organized by
Rayid Ghani, Carlos Soares, Françoise Soulié-Fogelman (KXEN), Katharina
Probst (Accenture Technology Labs) and Patrick Gallinari (Univ. of Paris), that
was held in Las Vegas, USA, in 2008 (http://labs.accenture.com/
kdd2008\%5Fworkshop/).
This book contains extended versions of a selection of papers from these workshops.
The chapters of this book cover the entire spectrum of issues in the development of data
mining systems with special attention to methodological issues. Although data mining
has reached a reasonable level of maturity and a large number and variety of algorithms,
tools and knowledge is available to develop good models and integrate them into business
processes, there is still space for research in new data mining methods. Many method-
ological issues still remain open, affecting several phases of data mining projects, from
business and data understanding to evaluation and deployment. As data mining gets ap-
plied to new business problems, new research challenges are encountered opening up
large unexplored areas of research. The chapters in Part 1, discuss some of the most
important of those issues. The authors offer diverse perspectives on those issues due to
the different nature of their backgrounds and experience, which include the automotive
industry, the data mining industry and the research community.
The book also covers a wide range of business domains, illustrating both classical
applications as well as emerging ones. The chapters in Part 2 describe typical problems
for which data mining has proved to be an invaluable tool, such as churn and fraud detec-
tion, and customer relationship management (CRM). They also cover some of the more
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

important industries, namely banking, government, energy and healthcare. The issues
addressed in these papers include important aspects such as how to incorporate domain-
speciﬁc knowledge in the development of data mining systems and the integration of
data mining technology in larger systems that aim to support core business processes.
The applications in this book clearly show that data mining projects must not be regarded
as independent efforts. They need to be integrated into larger systems to align with the
goals of the organization and those of its customers and partners. Additionally, the out-
put of data mining components must, in most cases, be integrated into the IT systems
of the business and, therefore, in its (decision-making) processes, sometimes as part of
decision-support systems (DSS).
The chapters in Part 3 are devoted to emerging applications of data mining. These
chapters discuss the application of novel methods that deal with complex data like social
networks and spatial data, to explore new opportunities in domains such as criminology
and marketing intelligence. These chapters illustrate some of the exciting developments
going on in the ﬁeld and identify some of the most challenging opportunities. They stress
the need for researchers to keep up with emerging business problems, identify potential
applications and develop suitable solutions. They also show that companies must not only

pay attention to the latest developments in research but also continuously challenge the
research community with new problems. We believe that the flow of new and interesting
applications will continue for many years and drive the research community to come up
with exciting and useful data mining methods.
This book presents a collection of contributions that illustrates the importance of
maintaining close contact between data mining researchers and practitioners. For re-
searchers, it is essential to be exposed to and motivated by real problems and understand
how business problems not only provide interesting challenges but also practical con-
straints which must be taken into account in order for their work to have high practical
impact. For practitioners, it is not only important to be aware of the latest technology
developments in data mining, but also to have continuous interactions with the research
community to identify new opportunities to apply existing technologies and also provide
the motivation to develop new ones.
We believe that this book will be interesting not only for data mining researchers
and practitioners that are looking for new research and business opportunities in DM, but
also for students who wish to get a better understanding of the practical issues involved
in building data mining systems and find further research directions. We hope that our
readers will find this book useful.

Porto, Chicago – July 2010

Carlos Soares and Rayid Ghani
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Members of the Program Committees

of the DMBiz Workshops

Alexander Tuzhilin New York University USA

Alípio Jorge University of Porto Portugal
André Carvalho University of São Paulo Brazil
Andrew Fano Accenture Technology Labs USA
Arlindo Oliveira IST/INESC-ID Portugal
Arno Knobbe Kiminkii The Netherlands
Leiden University
Bart Baesens University Leuven Belgium
Bin Gao Microsoft Research Asia China
Brigitte Trousse INRIA – Sophia Antipolis France
Carlos Fernando Nogueira Vivo Portugal
Carlos Soares University of Porto Portugal
Catarina Duarte SAS Portugal Portugal
Catherine Garbay University of Grenoble France
Chid Apte IBM Research USA
Christophe Giraud-Carrier Brigham Young University USA
Christian Derquenne EDF France
Clive Best JRC European Commission Belgium
Cristina Sequeira Crediﬂash Portugal
Damien Weldon LoanPerformance USA
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Dave Watkins SPSS USA

Dennis Wilkinson HP Labs USA
Dietrich Wettschereck Recommind Germany
Dirk Arndt DaimlerChrysler Germany
Donato Malerba University of Bari Italy
Doug Bryan KXEN USA
Dragos Margineantu Boeing Company USA
Dunja Mladenic Jozef Stefan Institute Slovenia
Eric Auriol Climpact France
Fátima Rodrigues Polytechnical Institute of Porto Portugal
Fernanda Gomes BANIF Portugal
Floriana Esposito Universitá degli Studi di Bari Italy
Foster Provost New York University USA
Gabor Melli PredictionWorks USA
Galit Shmueli University of Maryland USA
Gary Weiss Fordham University USA
Gerhard Heyer University of Leipzig Germany
Gerhard Paaß Fraunhofer Germany
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
x

Graham Williams Australian Taxation Ofﬁce Australia

University of Camberra
Grégory Grefenstette CEA France
Gregory Piatetsky-Shapiro KDNuggets USA
Guimei Liu National University of Singapore Singapore
Jakub Piskorski JRC European Commission Belgium
João Gama University of Porto Portugal
John Elder Elder Research USA
Joost Kok Leiden University The Netherlands
Jörg-Uwe Kietz Kdlabs AG Switzerland
Jorge Coelho SisConsult Portugal
José Luís Borges University of Porto Portugal
Katharina Probst Accenture Technology Labs USA
Ken Reed Lower My Bills USA
Khosrow Hassibi KXEN USA
Laura Squier KXEN USA
Leon Bottou NEC USA
Liu Zehua Circos.com Singapore
Longbing Cao University of Technology, Sydney Australia
Lubos Popelínský Masaryk University Czech Republic
Luc Dehaspe PharmaDM Belgium
Luís Moniz SAS Portugal Portugal
Luís Torgo University of Porto Portugal
Manuel Filipe Santos University of Minho Portugal
Mário Fernandes Portgás Portugal
Mário Silva University of Lisbon Portugal
Marko Grobelnik Josef Stefan Institute Slovenia
Markus Ackermann University of Leipzig Germany
Massih Amini University of Paris 6 France
Mehmet Göker PricewaterhouseCoopers USA
Michael Berthold University of Konstanz Germany
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Miguel Calejo Declarativa Portugal

Universidade do Minho
Min-Ling Zhang Hohai University China
Mykola Pechenizkiy University of Eindhoven Finland
Natasa Milic-Frayling Microsoft Research USA
Nick Dumas Brown Discover USA
Nitin Indurkhya University of New South Wales Australia
Paul Bradley Apollo Data Technologies USA
Paula Brito University of Porto Portugal
Paulo Cortez University of Minho Portugal
Paulo Oliveira WhiteBook Consulting Portugal
Paulo Quaresma University of Évora Portugal
Pavel Brazdil University of Porto Portugal
Pedro Barahona New University of Lisbon Portugal
Peter van der Putten Chordiant Software/ The Netherlands
Leiden University

Petr Berka University of Economics of Prague Czech Republic

Pieter Adriaans Robosail The Netherlands
Raul Domingos Vadis Belgium
Rayid Ghani Accenture Technology Labs USA
Reza Nakhaeizadeh University of Karlsruhe Germany
Rob Cooley KXEN USA
Robert Engels ESIS Norway
Robert Grossman Open Data Partners USA
University of Illinois at Chicago
Ronen Feldman Clearforest USA
Rüdiger Wirth DaimlerChrysler Germany
Rui Camacho University of Porto Portugal
Ruy Ramos University of Porto Portugal
Caixa Econômica do Brasil
Samy Bengio Google USA
Sascha Schulz Humboldt University Germany
Shengrui Wang University of Sherbrooke Canada
Stefan Wrobel Fraunhofer Institute Germany
Stefano Ferilli University of Bari Italy
Stephen Bay PricewaterhouseCoopers USA
Steve Gallant KXEN USA
Steve Moyle Secerno UK
Teresa Godinho Allianz Portugal
Thierry Artières University of Paris 6 France
Timm Euler University of Dortmund Germany
Tom Khabaza SPSS UK
Usama Fayyad Yahoo USA
Younes Bennani University of Paris 13 France
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Contents
Preface v
Carlos Soares and Rayid Ghani
Members of the Program Committees of the DMBiz Workshops ix

Data Mining for Business Applications: Introduction 1

Carlos Soares and Rayid Ghani

Part 1. Data Mining Methodology

Interactivity Closes the Gap – Lessons Learned in an Automotive Industry

Application 17
Axel Blumenstock, Markus Mueller, Carsten Lanquillon, Steffen Kempe,
Jochen Hipp and Ruediger Wirth
Best Practices for Predictive Analytics in B2B Financial Services 35
Raul Domingos and Thierry Van de Merckt
Towards the Generic Framework for Utility Considerations in Data Mining
Research 49
Seppo Puuronen and Mykola Pechenizkiy
Customer Validation of Commercial Predictive Models 66
Tilmann Bruckhaus and William E. Guthrie

Part 2. Data Mining Applications of Today

Customer Churn Prediction – A Case Study in Retail Banking 77

Teemu Mutanen, Sami Nousiainen and Jussi Ahola
Resource-Bounded Outlier Detection Using Clustering Methods 84
Luis Torgo and Carlos Soares
An Integrated System to Support Electricity Tariff Contract Definition 99
Fátima Rodrigues, Vera Figueiredo and Zita Vale
Mining Medical Administrative Data – The PKB Suite 110
Aaron Ceglar, Richard Morrall and John F. Roddick

Part 3. Data Mining Applications of Tomorrow

Clustering of Adolescent Criminal Offenders Using Psychological and

Criminological Profiles 123
Markus Breitenbach, Tim Brennan, William Dieterich and Greg Grudic
Forecasting Online Auctions Using Dynamic Models 137
Wolfgang Jank and Galit Shmueli
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
xiv

A Technology Platform to Enable the Building of Corporate Radar Applications

that Mine the Web for Business Insight 149
Peter Z. Yeh and Alex Kass
Spatial Data Mining in Practice: Principles and Case Studies 164
Christine Körner, Dirk Hecker, Maike Krause-Traudes, Michael May,
Simon Scheider, Daniel Schulz, Hendrik Stange and Stefan Wrobel

Subject Index 179

Author Index 181
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Data Mining for Business Applications:

Introduction
Carlos SOARES a,1 and Rayid GHANI b
a
LIAAD-INESC Porto L.A./Faculdade de Economia, Universidade do Porto, Portugal
b
Accenture Technology Labs, USA

Abstract. This chapter introduces the volume on Data Mining (DM) for Business
Applications. The chapters in this book provide an overview of some of the ma-
jor advances in the ﬁeld, namely in terms of methodology and applications, both
traditional and emerging. In this introductory paper, we provide a context for the
rest of the book. The framework for discussing the contents of the book is the DM
methodology, which is suitable both to organize and relate the diverse contributions
of the chapters selected. The chapter closes with an overview of the chapters in the
book to guide the reader.

Keywords. Data mining applications, data mining process.

Preamble

As the importance of the knowledge-based economy increases, Data Mining (DM) is

becoming an integral part of businesses and governments. Many applications have been
incorporated into the information systems (IS) and business processes of companies in
a wide range of industries (e.g., Information technology (IT), IS, e-business, banking,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

insurance, ﬁnance, retail, telecommunications, automotive, health and energy). Although

less publicized, DM is becoming equally important in Science and Engineering.2
Typical problems which DM is used for include marketing campaign management,
customer relationship management, product recommendation, churn prediction and fraud
detection. Successful applications of DM are leading companies to explore the use of
this technology to address other areas of their business, such as process planning and
management, human resources management, quality control, procurement and knowl-
edge management. Besides new problems and applications, interesting challenges and
opportunities are also being created by the development of other technologies, includ-
ing hardware, such as mobile devices and sensor networks, and software, such as social
networks.
As data mining becomes a mainstream technology in businesses, data mining re-
search has been experiencing explosive growth both in terms of interest as well as in-
vestment. In addition to well established application areas such as targeted marketing,
1 Corresponding Author: LIAAD-INESC Porto L.A./Universidade do Porto, Rua de Ceuta 118 6o andar;

E-mail: [email protected].
2 An overview of scientiﬁc and engineering applications is given in [1].

customer churn, and market basket analysis, we are witnessing a wide range of new ap-
plication areas, such as social media, social networks, and sensor networks. In addition,
more traditional industries and business processes, such as health care, manufacturing,
customer relationship management and marketing are also applying data mining tech-
nologies in new and interesting ways. These areas pose new challenges both in terms
of the nature of the data available (e.g., complex and dynamic data structures) as well
as in terms of the underlying supporting technology (e.g., low-resource devices). These
challenges can sometimes be tackled by adapting existing algorithms but at other times
need new classes of techniques.
A major reason behind the success of the data mining field has been the healthy
relationship between the research and the business worlds. This relationship is strong
in many companies where researchers and domain experts collaborate to solve practical
business problems. On the one hand, business problems are driving new research (e.g.,
the Netflix prize3 and DM competitions such as the KDD CUP4 ). On the other hand, re-
search advances are finding applicability in real world applications (e.g., support vector
machines in Computational Biology5 ). Many of the companies that integrate data min-
ing into their products and business processes also employ some of the best researchers
and practitioners in the field. Some of the most successful recent data mining companies
have also been started by distinguished researchers. Researchers in universities are get-
ting more connected with businesses and are getting exposed to business problems and
real data. Often, new breakthroughs in data mining research have been motivated by the
needs and constraints of practical business problems. Data Mining conferences, such as
KDD, ICDM, SDM, PKDD and PAKDD, play an important role in the interaction be-
tween researchers and practitioners. Companies are participating very actively in these
conferences, both by providing sponsorship as well as attendees.
This healthy relationship between academia and industry does not mean that there
are no issues left to be solved when building data mining solutions. From a purely tech-
nical perspective, plenty of algorithms, tools and knowledge is available to develop good
quality DM models. However, despite the amount of available information (e.g., books,
papers and web pages) about DM, some of the most practical aspects are not sufficiently
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

documented. These aspects include data preparation (e.g., cleaning and transformation),
adapting of existing methods to solve a new application, combination of different types
of methods (e.g., clustering and classiﬁcation), incorporation of domain knowledge into
data mining systems, usability of data mining systems, ease of deployment, and testing
and integration of the DM solution with the Information System (IS) of the company.
Not only do these issues account for a large proportion of the effort spent in a DM project
but they often determine its success or failure [2].
A series of workshops have been organized to enable the presentation of work that
addresses some of these concerns.6 These workshops were organized together with some
of the most important DM conferences:
• “Data Mining for Business” workshop, with ECML/PKDD, organized by Car-
los Soares, Luís Moniz (SAS Portugal) and Catarina Duarte (SAS Portugal),

3 http://www.netflixprize.com
4 http://www.sigkdd.org/kddcup/index.php
5 http://www.support-vector.net/bioinformatics.html
6 http://www.liaad.up.pt/dmbiz

which took place in Porto, Portugal, in 2005 (http://www.liaad.up.pt/

dmbiz/).
• “Data Mining for Business Applications” workshop, with KDD, organized by
Rayid Ghani and Carlos Soares, in Philadelphia, USA, in 2006 (http://labs.
accenture.com/kdd2006\%5Fworkshop/)
• “Practical Data Mining: Applications Experiences and Challenges” workshop,
with ECML/PKDD, organized by Markus Ackermann (Univ. of Leipzig), Carlos
Soares and Bettina Guidemann (SAS Deutschland), which took place in Berlin,
Germany, in 2006 (http://wortschatz.uni-leipzig.de/~macker/
dmbiz06/).
• “Data Mining for Business Applications” workshop, with KDD, organized by
Rayid Ghani, Carlos Soares, Françoise Soulié-Fogelman (KXEN), Katharina
Probst (Accenture Technology Labs) and Patrick Gallinari (Univ. of Paris), that
took place in Las Vegas, USA, in 2008 (http://labs.accenture.com/
kdd2008\%5Fworkshop/).
This book contains extended versions of a selection of papers from those workshops. The
chapters of this book cover the whole range of issues involved in the development of DM
projects. This makes the book interesting for Data Mining researchers and practitioners
that are looking for new research and business opportunities in DM, as well as students
who wish to learn more about the practical issues involved in DM projects and ﬁnd
opportunities for further research. This book complements a previous volume, which
includes selected papers from another workshop in the same series that was organized in
2007 together with PAKDD, in Nanjing, China7 [3].
In Section 1 we discuss some of the issues of the applications of DM that were
previously identiﬁed. An overview of the chapters of the book is given in Section 2.
Finally, we present some concluding remarks (Section 3).

1. Application Issues in Data Mining

Methodologies, such as CRISP-DM [4], typically organize DM projects into the follow-
ing six steps (Figure 1): business understanding, data understanding, data preparation,
modeling, evaluation and deployment. In the following, we brieﬂy present how the chap-
ters in this book address relevant issues for each of those steps.

1.1. Business and Data Understanding

In the business understanding step, the goal is to clarify the business objectives for the
project. The second step, data understanding, consists of identifying sources, collecting
and becoming familiar with the data available for the project.
A very important issue is the scope of the project. It is necessary to identify a busi-
ness problem rather than a DM problem and develop a solution which combines DM ap-
proaches with others, where and whenever necessary. Some of the chapters in this book
illustrate this concern quite well. Rodrigues et al. address the problem of recommending
the most suitable tariff for electricity consumers [5]. The system proposed combines DM
7 http://www.liaad.up.pt/dmbiz/

Figure 1. The Data Mining Process, according to the CRISP-DM methodology (image obtained from
http://www.crisp-dm.org)

with more traditional decision support systems (DSS) technologies, such as a knowledge
base, a database and an inference engine. In another chapter, Ceglar et al. describe a tool
to support patient costing in health units [6]. The authors wrap several DM models in a
tool for users who are not DM experts. In the corporate radar proposed by Yeh and Kass
the goal is to automatically monitor the web and select news stories that are relevant to a
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

business [7]. Technologies from diverse areas are combined, including text mining, nat-
ural language processing (NLP), semantic models, inference engines and web sensors.
The tool offers a set of dashboards to present the selected information to the business
users.
To view the project in context, it is necessary to identify the stakeholders. This is
discussed by Puuronen and Pechenizkiy in the context of DM research [8]. The needs
of those stakeholders, in particular the end users, must be understood. Several chapters
discuss this issue in different contexts, including Blumenstock et al. for the automobile
industry [9], Domingos and van de Merckt for the financial services industry [10] and
Bruckhaus and Guthrie for the semiconductor industry [11].
It is also essential to clearly define the goals of the project. This should be done in
terms of the business as well as of the data mining methods. Domingos and van de Mer-
ckt establish the profitability of customers as their business goal, while a typical accuracy
measure is used to assess the algorithms [10]. Whenever possible, goals should be quan-
tified. In the error detection application described by Torgo and Soares, the customer de-
fined thresholds for maximum effort and minimum results, which should be respected in
order for the project to be considered successful [12]. Sometimes it is necessary for the

project team to clarify the definition of concepts which could, at first sight, be considered
trivial. This is the case of the concept of churn in the churn prediction task tackled by
Mutanen et al. for a retail bank [13].
In some cases, there are some constraints associated with the process that affect the
DM effort and, thus, should be identified as soon as possible. One example is the limit in
the amount of resources available for follow-up activities in the application described by
Torgo and Soares, that changes over time [12].
Understanding the data and their sources is an increasingly important step due to the
growing volume, diversity and complexity of data. In their chapter, Jank and Shmueli
propose a system to process the huge amounts of complex data from online eBay auctions
online [14]. Another example is the system proposed by Yeh and Kass that processes
news from diverse sources on the web, which are then combined with semantic models
of the application [7]. The chapter by Körner et al. describes several applications that
combine spatial, time and socio-demographic data, which is quickly becoming a common
scenario [15].
Finally, an issue that is very rarely addressed is the assessment of the costs of the
DM activity. A few examples, such as the cost of gathering data and the costs of the
errors are discussed by Puuronen and Pechenizkiy [8].

1.2. Data Preparation

Data preparation consists of a diverse set of operations to clean and transform the data
so as to prepare it for the following modeling step [16].
The chapters by Blumenstock et al. and Rodrigues et al. illustrate some of the most
common kinds of problems that data may contain: imbalanced classes, inconsistent data,
outlier and missing values [9,5]. Some problems can be addressed with generic methods,
which do not depend on the application (e.g., replacing missing values with the mean
value of the attribute). In other cases, the correction uses domain knowledge (e.g., replac-
ing missing values with a special value). Blumenstock et al. ﬁnd that simple operations
such as discretization can be very important to produce models for users who are not
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

data miners [9].

Some data preparation operations cannot be done without domain knowledge. This
is the case of feature engineering (e.g., combining some of the original attributes into a
more informative one). In domains with highly complex data, such as the applications
described by Körner et al., this is particularly important [15]. To generate variables that
contain the information that is useful to achieve the project goals, a large investment is
necessary for this operation. Domingos and van de Merckt present a thorough analysis
of the problem domain, which lead to the design of the predictor variables [10]. Other
operations which benefit from the use of domain knowledge are the variable selection
and manual segmentation of the data, such as described in the chapters by Torgo and
Soares [12] and Rodrigues et al. [5].
Special attention should be given to the creation of the target variable, to make sure
that it represents the concept of interest for the business. For instance, Domingos and van
de Merckt need a clear understanding of the concept of lead profitability to define the
corresponding target variable [10]. A typical problem with the target variable is class im-
balance. This problem occurs in the bank customer churn detection problem of Mutanen
et al. [13] and also in the error detection application of Torgo and Soares [12].

An essential tool in data understanding and preparation is visualization. Plotting the

data in many different ways allows the data analyst to identify problems and also to ob-
tain important information that can be used for feature engineering and even modeling.
This is particularly true when the applications have spatial data, such as the ones de-
scribed by Körner et al. [15] but also for applications in other areas such as the health
sector, as discussed by Ceglar et al. [6].

1.3. Modeling

In the modeling step, the data resulting from the application of the previous steps is
analyzed to extract the knowledge that will be used to address the business problem.
In some applications, domain-dependent knowledge is integrated into the DM pro-
cess in all steps except this one, in which off-the-shelf methods/tools are applied. In this
volume, Mutanen et al. used logistic regression to identify bank customers that are likely
to churn [13]. Besides having been often reported as obtaining good results, this algo-
rithm is also used often because it generates models that can be understood by many end
users.
Sometimes, the results obtained with a single method are not satisfactory and better
solutions can be obtained with a combination of multiple methods. For instance, the sys-
tem proposed by Rodrigues et al. for electricity tariff recommendation includes cluster-
ing and classification modules [5]. In both of them, they use common algorithms, namely
k-means and Self-Organizing Maps (SOM) for clustering and decision trees for classifi-
cation. Another example is the system described by Domingos and van de Merckt, which
combines sequences of methods (including data preparation and modeling methods) to
develop a large number of models [10]. The methods are selected based on best practices
according to the experience of the authors. The issue of dealing with a very large num-
ber of models is becoming increasingly popular in DM, leading to what has been called
extreme data mining [17].
A different modeling approach consists of developing/adapting general methods for
a specific application, taking into account its peculiarities. In some cases, the applica-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

tion is not really new but has speciﬁc characteristics that require existing methods to be
adapted. In the chapter by Torgo and Soares, a hierarchical clustering algorithm is used
[12]. The use of clustering algorithms for outlier detection is not new. However, due to
the nature of the application, the algorithm was changed such that a ranking of the obser-
vations is generated, rather than a simple selection of the ones which are potential errors.
This makes it possible for the domain expert to use the output of the method in different
ways depending on the available resources.
Some applications involve novel tasks that require the development of new methods,
sometimes incorporating important amounts of domain knowledge. Some chapters in
this book describe new methods motivated by the underlying application, for the health
industry [6], criminology [18] and price prediction in eBay auctions [14]. In the chapter
by Körner et al. the methods are customized to deal with spatial data [15]. The com-
plexity of the data is such that the methods described by the authors incorporate prepro-
cessing mechanisms together with the model building ones. In the applications for the
automotive industry described by Blumenstock et al., the requirements for interactivity
are so strong, that new algorithms are proposed that can incorporate decisions made by
the users during the building of the models, which are called interactive algorithms [9].

A data analyst must also be prepared to use methods for different DM tasks and orig-
inating from different fields, as they may be necessary in different applications, some-
times combined as previously described. The applications described in this book illus-
trate this quite well, including some tasks which are not so common in DM applications.
They include clustering (e.g., [5,6,18]), classification (e.g., [9,10,13,5]), regression (e.g.,
[15]), quantile regression (e.g., [10]), outlier detection (e.g., [12,6]), subgroup discov-
ery (e.g., [9,15]), time series analysis (e.g., [6,14]), visual analytics (e.g., [6,15]) and
information retrieval and extraction (e.g., [7]).
Additionally, as previously stated, to build complete solutions for business prob-
lems, it is often necessary to combine DM techniques with others. These can be tech-
niques from artificial intelligence or related fields. In the application by Rodrigues et al.
the DM models are integrated into a decision support system (DSS) that also incorpo-
rates a knowledge base, a database and an inference engine [5]. In the corporate radar
proposed by Yeh and Kass, very diverse technologies are used, including text mining,
natural language processing (NLP), semantic models, inference engines and web sensors
[7]. In some cases, the solution may combine DM with more traditional techniques. In
the targeting application for the financial services industry described by Domingos and
van de Merckt, a manual segmentation of customers is carried out before applying the
DM methods [10]. In summary, the wider the range of tools that is mastered by a data
analyst (or the team working on the project), the better the results that can be obtained.
Some of the papers in this volume also discuss the importance of tools. Domingos
and van de Merckt observe that most of the DM tools available on the market are work-
benches that end up being too flexible [10]. This leaves the developer with many tech-
nical decisions to be made during the modeling phase when the focus should be on the
business issues. The tool developed by Ceglar et al. for patient costing addresses this
issue [6]. On the one hand, it is tuned for that particular application. On the other hand,
it leaves some room for the user to explore different methods. Blumenstock et al. pro-
vide a different perspective, arguing that the tools should be interactive during the model
building phase, to enable the domain expert to contribute with his/her knowledge during
the process [9].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1.4. Evaluation

The goal of the evaluation step is to assess the adequacy of the knowledge obtained
according to the project objectives.
For a DM project to be successful, the criteria selected to evaluate the knowledge
obtained in the modeling phase must be aligned with the business goals. In some cases,
it is possible to ﬁnd measures that the experts can relate to. A few examples can be found
in this book, with lift [9,13] and recall [9]. Torgo and Soares present an unusual case,
where the experts established goals in terms of two measures that are common in DM:
recall and percentage of selected transactions [12].
Very often, however, the measures that are commonly used to assess DM models do
not represent the business goals well. The chapter by Bruckhaus and Guthrie discusses
how evaluation can be made by customers, not only at the model level but at the individ-
ual decision level [11]. The authors give a few examples of how business goals can be
translated to technical DM goals. As discussed in some chapters in this book, visualiza-
tion tools are also very useful to present DM results to domain experts in a way that is
easy for them to understand (e.g., [6,18,15]).

Furthermore, most of the time, the evaluation of a DM system is not based on a single
but rather on multiple criteria. Criteria of interest usually include technical measures
as well as business-related ones (e.g., the cost of making an incorrect prediction). The
chapter by Puuronen and Pechenizkiy describes a framework that allows researchers to
take these considerations into account [8].
In many situations, the users not only require a model that achieves the goals of the
business in terms of a suitable measure (or measures) but they also need to understand
the knowledge represented by that model. In this case, the data must describe concepts
that are familiar to the users (e.g., [10]) represented in a way that they understand (e.g.,
by discretizing continuous attributes [9,5]). Additionally, the algorithm must generate
models in a language that is also understandable by the users (e.g., decision trees in
the automotive industry [9] and logistic regression in the ﬁnancial industry [13]). The
interactive algorithms proposed by Blumenstock et al. also contribute to evaluation in an
interesting way [9]. Given that the users interactively participate in the building of the
model, they are, thus, committing to the quality of the result.
Other tools that can be helpful for the evaluation are simulation (e.g., [5]), compari-
son with an existing theory (e.g., [18]) or with the decisions made by humans (e.g., [7])
and the use of satisfaction surveys (e.g., [7]).

1.5. Deployment

Deployment is the step in which the solution developed in the project, after being prop-
erly tested, is integrated into the (decision-making) processes of the organization.
Despite being critical for the success of a DM project, this step is often not given
sufﬁcient importance, contrarily to other steps such as business understanding and data
preparation. This attitude is illustrated quite well in the CRISP-DM guide [4]:
In many cases it is the customer, and not the data analyst, who carries out the deployment
steps. However, even if the analyst will not carry out the deployment effort it is important for
the customer to understand up front what actions need to be carried out in order to actually
make use of the created models.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This graceful handing over of responsibilities of the deployment step by the data analyst
can be the cause for the failure of a DM project which, up to this step, has obtained
promising results. This is conﬁrmed in some of the chapters in this book, which clearly
state that the development of adequate tools is important [9,6].
A very important issue when the tool is to be used by domain experts who are not
data miners is the interface. The system must be easy to use and present the results clearly
and using a language that is familiar to the user [9,5,7,15].
Another important aspect is caused by the changes in conditions when moving from
the development to the operational setting. DM projects are often developed with samples
of data and without concerns for learning and response times. However, when deploying
a DM solution, its scalability and efﬁciency should be considered carefully [14].
Given that the goal is to incorporate the result of the DM project into the company’s
business processes, it is usually necessary to integrate it with the information system
(IS). This can be done at different levels. On one end, it may be completely independent,
with data being moved from the IS to the system on a periodic basis. At the other end, we
have solutions that are tightly integrated into the IS, possibly requiring the system to be
reengineered. Most of the applications in this volume follow an intermediate approach,

where the DM solution is developed as a separate solution with its own user interface
and integration into the IS being achieved by the sharing of the database [9,5,6].

2. Overview

The chapters are organized into three groups. In Part 1, we present chapters that discuss
methodological issues. The chapters in Part 2 describe case studies in some of the com-
mon areas of application of DM. Finally, Part 3 contains chapters that address some in-
novative applications of DM technology. In the following sections we give an overview
of their content.

2.1. Part 1: DM Methodology

This part starts with a reflection by Blumenstock et al. on the extensive DM experience
at Daimler [9]. This company has been involved in the research and application of DM
technologies for over 20 years. This chapter uses a case study on warranty data analysis to
discuss some of the lessons learned during that period. They argue to focus on the needs
of the users and claim that the two most important principles to achieve are simplicity
and interactivity. In this work, they take this principle to an extreme. Besides integrating
the experts into the business understanding phase, the data understanding and preparation
phases, and the evaluation phase, they also use their domain knowledge explicitly in the
modeling phase: they propose interactive algorithms where the expert is asked to make
decisions during the model building process.
As the number of DM projects increases, together with the number of models de-
veloped for each task, companies are becoming interested in defining best practices to
reduce the effort without reducing the quality of the models. This problem is addressed
in the second chapter, focusing on the financial services sector [10]. The case study dis-
cussed is on the identification of sales leads in a B2B context. Prospective customers
are segmented and different models are generated for different segments. The authors
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

use a tool that embodies some of the best practices they developed. These best practices
support several of the phases of the DM process, such as data preparation, algorithm
selection, parameter tuning and evaluation.
Chapter 3 also addresses the problem of taking the requirements of users into ac-
count but the focus is not on DM projects [8]. The authors go back further to DM re-
search. They observe that in research, data are a resource for which benefits have been
widely publicized by the DM community but whose costs have been mostly ignored.
They argue that this problem can be addressed by taking a cost/benefit perspective in the
evaluation of DM, and propose a multicriteria, utility-based framework for that purpose.
This framework is flexible enough to be useful for users with different roles. Some of
the discussion on this chapter is based on an interesting parallel with the evolution of the
Information Systems (IS) field.
The last chapter in this part is an interesting contribution to the discussion on the
need to align general DM evaluation criteria with domain-specific criteria [2], by Bruck-
haus and Guthrie [11]. The authors argue that domain experts should be involved in the
evaluation of both models and individual predictions. The importance of using the lan-
guage of the domain both in the representation of the models and in the communication

between data miners and the domain experts is stressed. The ideas are discussed in the
context of a case study in the semiconductor industry.

2.2. Part 2: DM Applications of Today

The second part starts with a case study in one of the most popular application areas of
DM, Customer Relationship Manager (CRM), by Mutanen et al. [13]. In this particular
case, the paper addresses a churn prediction problem in a Finnish retail bank. The authors
use an algorithm which is often used in this type of problems, logistic regression. A
lot of attention is dedicated both to data preparation and model analysis and evaluation,
which is essential for a successful DM project on this domain. One particularly important
issue in churn prediction is the suitable definition of the churn variable. This variable is
frequently defined in such a way that, when churn is predicted, the customer is already
lost.
The case study described by Torgo and Soares in Chapter 7 is a good example of an
application in which the constraints of the application are deeply integrated in the DM
project [12]. The problem tackled is the detection of errors in data that are used by the
Portuguese institute of statistics to calculate foreign trade statistics. The constraints that
must be verified affect not only data preparation and evaluation but they also affect the
algorithm. The authors propose an error detection method based on hierarchical clus-
tering that is adapted to take into account that the resources available for inspection are
limited and change throughout time.
The application presented in Chapter 8 by Rodrigues et al. is on the energy industry,
an area with a growing number of opportunities for DM [5]. The chapter addresses the
problem of recommending the most appropriate tariff for energy consumers based on
their consumption profile. This chapter illustrates the integration of different DM meth-
ods with other techniques to build a decision support system (DSS) that will be used by
domain experts who are not knowledgeable about DM.
The application described in Chapter 9 by Ceglar et al.[6] is in healthcare, another
domain of major importance for DM. The authors describe a DM tool that is specific to
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

improve quality of care and resource management for hospitals. Although it is speciﬁc
for this domain and targeted at users who are not professional data miners, it gives the
users some freedom to explore the implemented algorithms. Being oriented towards non-
data mining experts, it focuses on simple communication with the users and has a strong
focus on visualization and model description techniques. This is also a good example of
an application that motivates new research. The tool implements data mining methods
that were developed to address some of the speciﬁcities of the domain. It is also a good
illustration of the collaboration between universities and companies and the authors also
discuss some of lessons learned.

2.3. Part 3: DM Applications of Tomorrow

Chapter 10 addresses an emerging application area for DM, criminology [18]. This is
essentially a descriptive DM application with the purpose of identifying criminal profiles
by Breitenbach et al. The specificities of the application, namely the need to obtain a
reliable description of profiles, lead to the development of a new clustering method. The
results also illustrate a very important contribution that DM techniques can make. Some

of the profiles identified by the method from the data are in contradiction with existing
theories on criminology research. This led to the need to further investigate their validity
and a potentially novel perspective of criminal profiles.
The second application in this part, by Jank and Shmueli, is in another promising
domain, social networks [14]. The problem tackled is the prediction of prices in online
auctions on eBay. The goal is to help users focus on the most potentially advantageous
auctions from the large number which may involve goods that are of interest to them.
This is a very challenging problem due to the complex and highly dynamic nature of
the data. A new method is proposed in which techniques from very diverse fields are
combined, including functional data analysis and econometrics, to make real-time fore-
casts. Besides a complex methodology, the solution incorporates a significant amount
of domain knowledge. The authors point out that to build a practical system incorporat-
ing this approach, two very important issues must be addressed, namely scalability and
efficiency.
The system described by Yeh and Kass in Chapter 12 addresses an essential problem
that companies face in today’s extremely competitive environment [7]: how to filter and
use external information that is relevant for its business. There are too many potential
sources of relevant information for a company to monitor all of them. The authors de-
scribe a tool to continuously monitor the web in search of important information for a
specific goal. The tool presents the information to users using dashboards for simplicity.
This is another example of the integration of different technologies, including text and
web mining, natural language processing and inference engines. These technologies are
combined with domain knowledge to build complex models of the context of a company.
Two prototype applications are described. One of them detects information that is used
to estimate the assessment of the maturity of emerging technologies. This can be used
by middle management to support decisions concerning which technology to invest in.
The second extracts business insights into the market from the web (e.g., use of new
materials by the competitors that enable a reduction in production costs), such as threats
and opportunities, which can be used for strategic decision making. The usability of the
tool is essential because it is used by managers who have little or no knowledge of data
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

mining. Therefore, the authors complemented a quantitative evaluation of the accuracy

of the information provided with a satisfaction survey. The result was encouraging, with
an overwhelming number of users indicating their intention to continue using the system.
The ﬁnal chapter reports on a series case studies made by Köerner et al. using spatial
data [15]. Geo-referenced data are becoming increasingly common and present an enor-
mous opportunity for richer and more useful DM models. The applications described
in this chapter are essentially in marketing and planning, and use a wide range of tech-
niques, including forecasting, visualization, and subgroup discovery. Due to the complex
nature of spatial data, preprocessing is even more important than is usual in DM, partic-
ularly the task of feature extraction, to obtain variables that encode the necessary infor-
mation. Despite the potential in spatial data, the authors observe that there are still few
DM methods that are customized to take advantage of it as well as to deal with some of
the issues it raises (e.g., volume). They identify a few methods that have that ability, in-
cluding a subgroup discovery algorithm that incorporates a feature selection mechanism.
This is particularly important in spatial data mining because a large number of features
can typically be generated. One of the challenges of spatial DM that is illustrated in this
chapter is the combination of spatial data (collected from digital maps, GPS or even sur-

veys) with time and socio-demographic data. This raises challenging issues which turns
this into one of the more interesting research areas in data mining.

3. Conclusions

As data mining becomes a mainstream technology in businesses, data mining research

has experienced unprecedented activity. New problems and application domains together
with new technologies are originating exciting research challenges. This evolution is en-
abled by a healthy relationship between academia and industry that have been collabo-
rating with each other to advance the science and practice of data mining.
This book contains extended versions of papers that were presented at a series of
workshops on Data Mining for Business Applications. They discuss both classical and
emerging applications, as well as new methods which arise from some of the new chal-
lenges the field is facing, such as social networks and spatial data.
In spite of the maturity of the field, the documentation of some of the most practical
aspects of DM projects is still scarce (e.g., data preparation, adapting general methods to
specific applications and deployment of DM models). Selecting contributions to ease this
problem was one of our major concerns in this volume. The book clearly shows that DM
projects must not be regarded as independent efforts but they should rather be integrated
into broader projects that are aligned with the organization’s goals. In most cases, the
output of the DM component is a solution that must be integrated into the organization’s
information systems and, therefore, in its (decision-making) processes.
We believe that this book may be interesting not only for Data Mining researchers
and practitioners that are looking for new research and business opportunities in DM, but
also for students who wish to have an idea of the practical issues involved in DM projects
and find opportunities for further research.

Acknowledgments
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

We would like to start by acknowledging Pavel Brazdil, who suggested to Carlos S. the
organization of the ﬁrst workshop, held together with ECML/PKDD.
We are also indebted to the colleagues who helped us organize the workshops which
have been the basis for this book: Luís Moniz, Catarina Duarte, Markus Ackermann,
Bettina Guidemann, Françoise Soulié-Fogelman, Katharina Probst and Patrick Gallinari.
We are also thankful to the members of the Program Committee for their timely and
thorough reviews, despite receiving more papers than promised, and for their comments,
which we believe were very useful to the authors.
We also wish to thank the organizations of the conferences who have hosted the
workshops: ECML/PKDD 2005,8 KDD 2006,9 ECML/PKDD 200610 and KDD 2008.11
We are very thankful to everybody who helped us to publicize the workshop, par-
ticularly Gregory Piatetsky-Shapiro (www.kdnuggets.com), Guo-Zheng Li (MLChina
8 http://www.liaad.up.pt/~ecmlpkdd05/
9 http://www.sigkdd.org/kdd2006/
10 http://www.ecmlpkdd2006.org/
11 http://www.sigkdd.org/kdd2008/

Mailing List in China) and KMining (www.kmining.com). We are also thankful to Rita
Pacheco, from INESC Porto LA, for revising some of the chapters.
The support of several supporting institutions is also gratefully acknowledged:
SAS,12 SPSS,13 KXEN,14 Accenture15 and LIAAD-INESC Porto LA.16
The first author wishes to thank the financial support of the Faculdade de Econo-
mia do Porto and the following projects: Triana (POCTI/TRA/61001/2004/Triana),
Site-O-Matic (POSI/EIA-58367-2004), Oranki (PTDC/EIA/ 68322/2006) and Rank!
(PTDC/EIA/81178/2006), funded by (Fundação Ciência e Tecnologia) co-financed by
FEDER.

References

[1] Robert L. Grossman, Chandrika Kamath, Philip Kegelmeyer, Vipin Kumar, and Raju R. Namburu. Data
Mining for Scientific and Engineering Applications. Kluwer Academic Publishers, Norwell, MA, USA,
2001.
[2] R. Kohavi and F. Provost. Applications of data mining to electronic commerce. Data Mining and
Knowledge Discovery, 6:5–10, 2001.
[3] Carlos Soares, Yonghong Peng, Jun Meng, Takashi Washio, and Zhi-Hua Zhou, editors. Applications
of Data Mining in E-Business and Finance, volume 177. IOS Press, Amsterdam, The Netherlands, The
Netherlands, 2008.
[4] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0:
Step-by-Step Data Mining Guide. SPSS, 2000.
[5] Fátima Rodrigues, Vera Figueiredo, and Zita Vale. An integrated system to support electricity tariff
contract definition. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications,
Frontiers in Artificial Intelligence and Applications, chapter 8. IOS Press, 2010.
[6] Aaron Ceglar, Richard Morrall, and John F. Roddick. Mining medical administrative data –the PKB
suite. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in
Artificial Intelligence and Applications, chapter 9. IOS Press, 2010.
[7] Peter Z. Yeh and Alex Kass. A technology platform to enable the building of corporate radar applications
that mine the web for business insight. In Carlos Soares and Rayid Ghani, editors, Data Mining for
Business Applications, Frontiers in Artificial Intelligence and Applications, chapter 12. IOS Press, 2010.
[8] Seppo Puuronen and Mykola Pechenizkiy. Towards the generic framework for utility considerations in
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

data mining research. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications,
Frontiers in Artificial Intelligence and Applications, chapter 4. IOS Press, 2010.
[9] Axel Blumenstock, Markus Mueller, Carsten Lanquillon, Steffen Kempe, Jochen Hipp, and Ruediger
Wirth. Interactivity closes the gap. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business
Applications, Frontiers in Artificial Intelligence and Applications, chapter 2. IOS Press, 2010.
[10] Raul Domingos and Thierry van de Merckt. Best practices for predictive analytics in B2B financial
services. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers
in Artificial Intelligence and Applications, chapter 3. IOS Press, 2010.
[11] Tilmann Bruckhaus and William Guthrie. Customer validation of commercial predictive models. In
Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial
Intelligence and Applications, chapter 5. IOS Press, 2010.
[12] Luís Torgo and Carlos Soares. Resource-bounded outlier detection using clustering methods. In Carlos
Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial Intelli-
gence and Applications, chapter 7. IOS Press, 2010.

12 http://www.sas.com/
13 http://www.spss.com/
14 http://www.kxen.com/
15 http://www.accenture.com/
16 http://www.liaad.up.pt/

[13] Teemu Mutanen, Sami Nousiainen, and Jussi Ahola. Customer churn prediction - a case study in retail
banking. In Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers
in Artificial Intelligence and Applications, chapter 6. IOS Press, 2010.
[14] Wolfgang Jank and Galit Shmueli. Forecasting online auctions using dynamic models. In Carlos Soares
and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial Intelligence and
Applications, chapter 11. IOS Press, 2010.
[15] Christine Körner, Dirk Hecker, Maike Krause-Traudes, Michael May, Simon Scheider, Daniel Schulz,
Hendrik Stange, and Stefan Wrobel. Spatial data mining in practice: Principles and case studies. In
Carlos Soares and Rayid Ghani, editors, Data Mining for Business Applications, Frontiers in Artificial
Intelligence and Applications, chapter 13. IOS Press, 2010.
[16] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
[17] Françoise Soulié-Fogelman. Data mining in the real world: What do we need and what do we have? In
R. Ghani and C. Soares, editors, Proceedings of the Workshop on Data Mining for Business Applications,
pages 44–48, 2006.
[18] Markus Breitenbach, Tim Brennan, William Dieterich, and Greg Grudic. Clustering of adolescent crim-
inal offenders using psychological and criminological profiles. In Carlos Soares and Rayid Ghani, edi-
tors, Data Mining for Business Applications, Frontiers in Artificial Intelligence and Applications, chap-
ter 9. IOS Press, 2010.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Interactivity Closes the Gap

Lessons Learned in an Automotive Industry Application

Axel BLUMENSTOCK a,1 , Markus MUELLER b , Carsten LANQUILLON c ,

Steffen KEMPE a , Jochen HIPP a , and Ruediger WIRTH a
a
Quality Analysis, Daimler AG, Germany
b
University of Bamberg, Germany
c
Heilbronn University, Germany

Abstract. After nearly two decades of data mining research there are many com-
mercial mining tools available, and a wide range of algorithms can be found in
literature. One might think there is a solution to most of the problems practition-
ers face. In our application of descriptive induction on warranty data, however, we
found a considerable gap between many standard solutions and our practical needs.
Confronted with challenging data and requirements such as understandability and
support of existing work ﬂows, we tried many things that did not work, ending up
in simple solutions that do. We feel that the problems we faced are not so uncom-
mon, and would like to advocate that it is better to focus on simplicity—allowing
domain experts to bring in their knowledge—rather than on complex algorithms.
Interactivity and simplicity turn out to be key features to success.

Keywords. Subgroup Discovery, Interactivity, Human Computer Interaction

1. Introduction
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

An air bellow bursts: This happens on one truck, on another it does not. Is this random
coincidence, or the result of some systematic weakness? Questions like these have ever
been keeping experts busy at Daimler’s After Sales Services. Detecting and ﬁxing unex-
pected quality issues as early as possible is key to a continuous improvement of Daim-
ler’s top quality products and to ensure customer satisfaction.
This primary goal of quality enhancement entails several tasks to be solved:
• predicting upcoming quality issues as early as possible,
• explaining why some kind of quality issue occurs and feeding this information
back to engineering,
• isolating groups of vehicles that might be affected by a certain defect in future, so
as to make service actions more targeted and effective.
When working on these tasks, quality engineers get valuable insights from analyzing
warranty and production data. Systems for early warning, quality reporting, warranty
cost control, and root cause investigations build upon a quality warehouse which inte-
1 Contact author: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
18 A. Blumenstock et al. / Interactivity Closes the Gap

grates heterogeneous data sources like warranty claims, manufacturing-related data, or

diagnostics information.
Our research group picks up common data mining methods and adapts them to the
practical needs of our engineers and domain experts. This contribution reports on the
lessons learned when designing a system that supports root cause investigations and the
planning of effective service actions. In particular, we elaborate on our experience that
the right answer to domain complexity need not be algorithmic complexity—but rather
simplicity. Simplicity opens ways to create an interactive setup which involves experts
without overwhelming them. And if truly involved, an expert will understand and accept
the results and turn them into action.
In the following section, we will outline the problem setting and discuss the speciﬁc
application requirements. In Section 3 we present lessons-learned when applying stan-
dard data mining methods to our domain. Apart from what we tried and did not work, we
introduce interactive decision trees and interactive rule cubes that have gained notable
user acceptance and haven proven real practical value. The subsequent section explains
the key features of our interactive data analysis tool. In a real world case study we learn
how this system can be applied in practice.

2. Domain and Requirements

2.1. The Domain

Our users are experts in the field of vehicle engineering, specialized on various subdo-
mains such as engine or electrical equipment. They keep track of what is going on in the
field, mainly by analyzing warranty data, and try to discover upcoming quality issues as
early as possible. As soon as they recognize a problem, they strive for finding out the
root cause in order to address it most accurately.
They have been doing these investigations successfully over years. Now, data mining
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

can help them to better meet the demands of fast reaction, well-founded insight and
targeted service. But any analysis support must fit into the users’ mindset, their language,
and their work flow.
The structure of the problems to be analyzed varies substantially. This task requires
inspection, exploration and understanding for every case anew. Ideally, the engineers
should be enabled to apply various exploration and analysis methods from a rich repos-
itory. And it is important that they do it themselves, because no one else could decide
quickly enough whether a certain clue is relevant and should be pursued, nor ask the
proper questions. Explaining strange phenomena requires both comprehensive and de-
tailed background knowledge.
Yet, the engineers are not data mining experts. They could make use of data mining
tools out of the box, but common data mining suites already require deeper understanding
of the methods. Further, the users are reluctant to accept any system-generated hypothesis
if the system cannot give exact details that justify this hypothesis. The bottom line is
that comprehensability and, again, interactivity are almost indispensable features of any
mining system in our field.

2.2. The Data

Most of the data at hand is warranty data, providing information about diagnostics and
repairs at the dealerships. Further data is about vehicle production, conﬁguration and
usage. All these sources are heterogeneous, and the data is not collected for the purpose
of causal analyses. This raises questions about reliability, appropriateness of scale, and
level of detail. Apart from these concerns, our data has some additional properties that
make it hard to analyze, including

Imbalanced classes: The class of interest, made up of all instances for which a certain
problem was reported, is very small compared to its contrast set. Often, the pro-
portion is far below 1 %.
Multiple causes: Sometimes, a single kind of problem report can be traced back to dif-
ferent causes that produced the same phenomenon. Therefore, on the entire data
set, even truly explanatory rules show only modest qualities in terms of statistical
measures.
Semi-labeledness: The counterpart of the positives is not truly negative. If there is a
warranty entry for some vehicle, it is (almost) sure that it indeed suffered the prob-
lem reported on. For any non-positive example, however, it is unclear whether it
carries problematic properties and may fail in near future.
High-dimensional space of influence variables: There are 1000s of variables, each be-
ing potentially relevant only for a specific subset of quality issues. Although the
feature space can be reduced tremendously by automatic and interactive feature
selection, many variables have to be considered artifacts that are irrelevant for a
specific analysis.
Influence variables interact strongly: Some quality issues do not occur until several
influences coincide. And, if an influence exists in the data, many other non-causal
variables follow, showing positive statistical dependence with the class as well.
True cause not in data: It is very unlikely that the actual cause to a quality issue is
among the configuration or production related variables that make up the data. Of-
ten, the engineer has to deduce abstract concepts like usage scenarios from vehicle
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

properties related to this concept.

These properties make it rather impossible that a truly causal influence is found by an
automated process. Yet, these problems can be addressed by allowing users to bring in
their background knowledge. As long as this knowledge is very much case-specific and
thus difficult to formalize, there seems to be hardly any alternative to a setup of interactive
model-building.

2.3. The Data Mining Task

Let us ﬁrst have a theoretical look at the problem. We consider vehicles non-conforming
that are more likely to be affected by a speciﬁc quality issue than others. For any vehicle
we would like to be able to tell whether it is likely to encounter problems in the future.
However, the model should not only be predictive, but primarily help the engineer in
understanding and explaining the quality issue. The notion of causality plays a key role
in this process. Our goal is to come as close to the root cause of a quality issue as possible
by identifying the characteristics of non-conforming vehicles (Figure 1). Hence, data

mining should not only help to reveal potentially useful inﬂuences the engineer would
have never thought of, but should also help to narrow down the root cause by eliminating
non-causal ﬁndings and thus rejecting hypotheses.

Edition=Avantgarde

Edition=Classic

Figure 1. The data mining task is to separate non-conforming vehicles that are likely to be affected by a speciﬁc
quality issue (×) from all other vehicles () by identifying the main characteristics of the non-conforming
ones. In the example, the fraction of non-conforming vehicles among the Classic edition is higher than among
Avantgarde vehicles.

Table 1. Example dataset: 2 700 in 60 000 vehicles are non-conforming vehicles with a DTC set.
No. Vehicles
Cruise Control Edition
with DTC
Yes Avantgarde 200 10 000
No Avantgarde 400 20 000
Yes Classic 1 400 20 000
No Classic 700 10 000
2 700 60 000

Table 2. The sales code Cruise Control seems to be related to the failure when looking at the whole dataset
2(a). If one analyses the subset of Avantgarde and Classic vehicles separately, Cruise Control does no longer
have an effect on the failure 2(b).
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

(a) Combined Dataset (b) Separate Datasets

Fault Rate Fault Rate Fault Rate
Cruise Control Cruise Control
(Avantgarde, Classic) (Avantgarde) (Classic)
Yes 5.3 % Yes 2.0 % 7.0 %
No 3.7 % No 2.0 % 7.0 %

To motivate the importance of causality, let us have a look at the following illustrat-
ing example (Table 1). Assume that 2 700 in 60 000 vehicles are non-conforming vehicles
that are brought to dealerships because a lamp indicates a diagnostic trouble code (DTC).
Now the engineer compares the fault rates for vehicles equipped with cruise control to
those without (Table 2(a)). A binomial test would indicate a signiﬁcant deviation of the
target share in the subgroup of vehicles with cruise control. However, if the engineer
calculates fault rates for both, Classic and Avantgarde vehicles, separately, cruise control
does no longer seem to be related to the issue (Table 2(b)). The variable cruise control
is conditionally independent of the class variable given the state of variable edition. The
primary inﬂuence is edition and cruise control only seems to be important as more Clas-

sic edition vehicles are equipped with cruise control than Avantgarde vehicles. What is
worse is that the true cause is probably hidden behind the inﬂuence edition. Hence, our
goal is to come as close to the true cause as possible by suppressing ﬁndings that are
likely to be non-causal.

2.3.1. What we tried

A great portion of the task could be seen as a classification problem. As stated above,
however, data is semi-labeled, and the problem behind the positive class may have mul-
tiple causes. These properties act as if there was a strong inherent noise that changes the
class variable in either direction. Classifier induction tries to separate the classes in the
best possible way, but can return unpredictable, arbitrary results when noise increases.
For our application, it suffices to grab the most explainable part of the positives and leave
the rest for later investigation or, finally, ascribe it to randomness. In other words, we
experienced that anything beyond partial description is not adequate here (confer Hand’s
categorization into mining the complete relation versus local patterns [1]).
So we came up with subgroup discovery (e.g., [2]). It means to identify subsets of
the entire object set which show some unusual distribution with respect to a property
of interest—in our case, the binary class variable. Results from subgroup discovery ap-
proaches need not be restricted to knowledge acquisition, but can be re-used for picking
out objects of interest. This is the partial classification we want, where a statement about
the contrast set is not adequate or required.
Still, the data properties described make subgroup discovery results unusable most
of the time. There are many candidate influences, and they interact strongly. Therefore,
even if the cause could be described by a single variable, it would be hard to find it among
the set of variables influenced by it. All these variables, including the causal one, would
refer to roughly the same subset of vehicles with an increased proportion of positives.

2.3.2. What works

Rather than mere discovery, we focus on subgroup description. This means to identify
the very same subgroups, but in a way as comprehensive and informative as possible.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The rationale is, in a domain with thousands of strongly interrelated influence variables
there may be many descriptions that are statistically similar, but only some of them give
the crucial hint to the problem cause. How should a system distinguish them? Subgroup
description is thus required to provide any reasonable explanation as long as there is no
evidence that the finding is void or unjustified.
In short, rather than finding the best-fitting model or composing some statistically
optimized representation, the new challenge is to provide guidance to a user who inter-
actively explores a huge space of hypotheses. In the course of such analyses, he perma-
nently formulates his own hypotheses and wants the system to tell him what supports or
contradicts his assumptions, and what is worth considering beyond.

3. Data Mining Methods

3.1. Interactive Decision Trees

Subgroup discovery (and description) can be mapped to partitioning the instance set
into multiple decision tree leaves. Paths to leaves with a high share of the positive class

provide a description of an interesting subgroup. In fact, decision tree induction roughly

corresponds to what our experts had been doing even before getting in touch with data
mining. Hence, decision trees were the ﬁrst method we chose.

3.1.1. What we tried

To quickly provide the users with explanation models, it was proximate to build deci-
sion trees automatically as is typically done when inducing tree-based classiﬁers [3,4,5].
However, the experts considered the results unusable most of the time, because the split
attributes that had been selected by any of the common top-down tree induction algo-
rithms were often uninformative or meaningless to them: The top-ranked variable was
rarely the most relevant one, but of course it is the one picked by automatic tree induction.
For some time, we experimented with different measures. Literature suggests mea-
sures such as information gain, information gain ratio, χ2 p-value, or gini index, to men-
tion the most important ones.
However, in an exemplary analysis case, there was a variable that gave the actual
hint for the expert to discover the quality issue’s cause. This variable was ranked 27th by
information gain, 41st by gain ratio, 36th by p-value and 33rd by gini index. We conclude
that an automatic induction process is not suitable in our application setting.

3.1.2. What works

This is where interactivity comes into action. Building trees interactively relieves the
measure of choice from the burden of selecting the single “best” split attribute. The idea
is almost trivial: Present the attributes in an ordered list and let the expert make tentative
choices until he ﬁnds one he considers plausible.
What remains is the problem of how to rank the attributes in a reasonable way, i.e. so
that attributes which are more likely to be interesting are ranked higher. But even for this
ranking, the aforementioned statistical measures proved little helpful. We explain this by
the fact that they are measures designed for classiﬁer induction, trying to separate the
classes in the best possible way. But as illustrated in Section 2.3, this is not the primary
goal in our application. Hence, for interactive subgroup discovery and description, we
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

require that the measure be able to identify interesting subgroups, i.e. single nodes of the
decision tree, and that it be comprehensible to the business users.
Most of the time, we deal with two-class problems: the positive class C = 1 versus
the contrasting rest C = 0 with the positive class attracting our attention as the interesting
concept. Hence, we may use the measure lift (the factor by which the positive class rate
P (C = 1 | A = a) in a given node A = a is higher than the positive class rate in the
root node P (C = 1 | ∅)):
P (C = 1 | A = a)
lift(A = a → C = 1) =
P (C = 1)
To complement the lift value of a tree node, we use the recall (the fraction covered) of
the positive class:

recall(A = a → C = 1) = P (A = a | C = 1)

Both lift and recall are readily understandable for the business users as they have imme-
diate analogies in their domain. Furthermore, note that as we are considering only the lift

15
wtLift isometrics

topLift

0% topRecall 100%

Figure 2. Quality space for the assessment of split attributes. Each dot represents an attribute, plotted over
recall (x axis) and lift (y axis) of the best (possibly clustered) child that would result. Dots are plotted bold if
there is no other dot that is better in both dimensions. The curves are isometrics according to the recall-weighted
lift (wtLift).

and recall values of the most interesting node rather than an average value over all nodes
resulting from a split as is done in general tree induction for classiﬁcation tasks, we are
now able to focus on interesting subgroups.
By focusing on high-lift paths, the users can successively split tree nodes to reach a
lift as high as possible while maintaining nodes with substantial recall. Note that simply
choosing nodes with maximal lift does not sufﬁce as this often results in nodes with only
very few instances which are obviously not helpful in explaining root causes despite their
high lift. Therefore, we always have to consider both lift and recall in combination.
In order to condense lift and recall values into a suitable attribute ranking, we have
derived a one-dimensional measure which we refer to as weighted lift or “explanational
power”:

1
wtLift (A = a → C = 1) = P (A = a | C = 1) 1 −
lift (A = a → C = 1)
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

This weighted lift is not intended to add to the tens of statistical quality measures that
already exist. Many other measures will do. The point is that while we need a scalar
measure to provide the attribute ranking, we always present the lift and recall values for
each attribute to the users, too, as these are the measures they understand.
As an alternative to a ranked list, the user can still get the more natural two-
dimensional presentation of the split attributes (Figure 2). Similar to plotting a ROC
space, every such attribute is drawn as a point while using recall and lift as the two
dimensions.
Using the weighted lift to rank attributes works well for symbolic attributes with
a small number of possible values. For attributes with large domains and for numeric
attributes we require some further processing to suit our interactive setting.
Let us consider symbolic attributes with large domains. There are two pitfalls in our
application domain. First, our measure is more sensitive with regard to skewed distri-
butions of attribute values as we consider only single nodes rather an averages. Second,
as we focus on interesting paths within the tree, any further nodes resulting from a split
only distracts attention and impairs understandability.

To mitigate these problems, we group attribute values (or, the current node’s chil-
dren). We require the resulting split to create at most k children, where typically k = 2 so
as to force binary splits. This ensures both that the split is “handy” and easily understood
by the user, and that the subsequent attribute ranking can be based consistently on the
child node with the highest lift.
C4.5 [5], for example, offers some grouping facility that merges pairs of children as
long as the gain ratio does not degrade. However, this algorithm of squared complexity is
neither necessary nor reasonable in our two-class world with a focus on high-lift paths,
the more so as in interactive use low response times are desirable.
To group the children in a reasonable and efﬁcient way without any structural infor-
mation on the attribute domain, we automatically proceed as follows. Initially we sim-
ply sort the nodes resulting from a split by their lift values. Then, keeping their linear
order, we cluster them using several heuristics: ﬁrst, to increase robustness of the ap-
proach, merge smallest nodes with their nearest neighbor with regard to lift values. Then
we continue in agglomerative clustering manner: merge adjacent nodes with lowest lift
difference until the desired number of nodes is reached.
Although the grouping of attribute values is performed automatically during attribute
assessment, the users may undo and redo the grouping interactively. They may even
arrange the attribute values into any form that they desire. This is important to further
incorporate background knowledge, e.g. with respect to ordered domains, geographical
regions, or, in particular, components that are used in certain subsets of vehicles and
should, thus, be considered together.
no. vehicles produced
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

May Jun Jul Aug Sep Oct Nov Dec Jan Feb

Figure 3. Example of a numeric split chart for the attribute BuildDate. The height of the bars indicates the
subgroup size, i.e. the number of vehicles produced in a speciﬁc month. The color of the bars encodes the lift
on a scale from green to red (here: grayscale).

Now, consider numeric attributes. In order to be used in a decision tree, the numeric
domain has to be discretized. For our interactive setting it is most adequate for the rea-
son of understandabilty to provide either binary or ternary splits. All we have to do is
adapting the method of dealing with numeric attributes deployed by any one of the com-
mon tree induction algorithms so that split point quality is assessed based on our mea-
sures with respect to the single most interesting node instead of the standard measures
averaging over all nodes of a split.
To enhance interactivity, the users may modify a resulting split by interactively
choosing their own split points. Again, this allows further incorporation of background
knowledge, which proved to be especially useful for date attributes, e.g. with respect to
known changes in product development or known clean points in the production pro-

cess. Supporting split point selection by means of interactive diagrams which visualize
the relevant lift and recall values depending on the numeric attribute under consideration
(Figure 3) is of great value for the users. In addition, it is very helpful to immediately
show a preview of the resulting sub-tree while adjusting the split points.
Two split previews for the example data set in Table 1 are depicted in Figure 4.
Weighted lift ranks the variable Edition higher than the variable CruiseControl.

fault rate 4.5% fault rate 4.5%

issue 2700 issue 2700
sum 60000 sum 60000
Edition=Avantgarde Edition=Classic CruiseControl=no CruiseControl=yes

fault rate 2% fault rate 7% fault rate 3.7% fault rate 5.3%
issue 600 issue 2100 issue 1100 issue 1600
sum 30000 sum 30000 sum 30000 sum 30000

(a) Decision Tree A (b) Decision Tree B

Figure 4. Two possible splits for the example data set in Table 1. The decision tree in Figure 4(a) is superior
to the split in Figure 4(b) in regards to lift and recall. The color of the node header encodes the lift.

3.1.3. Causality
Interactivity plays the key role in our approach and it is important that a model does
not maximize a statistically motivated scoring function, but the expert’s degree of belief
in the correctness of a hypothesis. Hence, an engineer could be tempted to pick cruise
control instead of edition for some reason in the example above. By looking ahead one
level as done in Figure 5, one can detect non-causality. Interactive look-ahead decision
trees based on the application of Bayesian partition models as described in [6] consider
causality in the attribute ranking. Moreover, taxonomies and partonomic structures are
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

exploited to make the model more accurate. Normalized mutual information is used to
measure and visualize attribute similarity during split attribute selection.

3.2. Interactive Search in Complete Pattern Space

We mentioned the observation that in real life, influences sometimes interact in a way
that a quality issue does not occur until several influences coincide. While decision tree
building is intuitive, its search is greedy and thus may miss such interesting combina-
tions. So the experts asked for automatic, more comprehensive search. While data is a in-
herently fragmentary picture of reality, and no complete enumeration of patterns is guar-
anteed to find relevant influences, the user can be assured that he gets the most probably
useful patterns at least within the specified data range and search depth.
A second determinative observation is that the quality issue that defines the class
variable often traces back to several independent sub-phenomena, and that a lot of quasi-
noise exists. A typical model generation regime that tries to fit the entire data in the best
possible way will easily be mislead on such data. Building only partial models seems a
way out.

fault rate 4.5%

issue 2700
sum 60000
Edition=Avantgarde Edition=Classic

fault rate 2% fault rate 7%

issue 600 issue 2100
sum 30000 sum 30000
CruiseControl=no CruiseControl=yes CruiseControl=no CruiseControl=yes

fault rate 2% fault rate 2% fault rate 7% fault rate 7%

issue 400 issue 200 issue 700 issue 1400
sum 20000 sum 10000 sum 10000 sum 20000

Figure 5. The variable CruiseControl is conditionally independent of the target variable given the variable
Edition.

Table 3. Rule set for the example data in Table 1. The consequence is ﬁxed and thus omitted. Similiar rules,
i.e. rules that cover similiar instances, or potentially non-causal rules can be highlighted in our application.
Subgroup Coverage Recall Fault Rate wtLift
Edition=Classic 50% 78% 7% 0.28
Edition=Avantgarde 50% 22% 2% −0.28
CruiseControl=yes 50% 59% 5.3% 0.09
CruiseControl=no 50% 41% 3.7% −0.10
Edition=Classic ∧ CruiseControl=yes 33% 52% 7% 0.19
Edition=Classic ∧ CruiseControl=no 17% 26% 7% 0.09
Edition=Avantgarde ∧ CruiseControl=yes 17% 7% 2% −0.09
Edition=Avantgarde ∧ CruiseControl=no 33% 15% 2% −0.19
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3.2.1. What we tried

These two observations first led us to rule-based subgroup discovery. Among others (e.g.,
[7,8]), a well-known subgroup discovery algorithm is CN2-SD [9]. It induces rules by
sequential covering: By heuristic search, if finds a single rule that is best according to
some statistical measure. Then it reduces the weights of the covered examples, and re-
iterates until no more rule of sufficient quality can be found.
The first handicap of this procedure is its strong dependence on some quality mea-
sure (same as noninteractive decision trees above): There is no measure that could guar-
antee to select the best next model element, here: rule.
But even the whole rule set that is subsequently mined that way lacks this descriptive
completeness the experts ask for when they face the task of causal investigation: Imagine
there are two rules describing roughly the same example set. CN2-SD will never find
both, because by modifying the weights of the examples covered, the ranks of both rules
change simultaneously. Obviously, this runs counter the idea of subgroup description—
see Section 2.3.2 for details.

The next logical step was making search exhaustive (within constraints). It is real-
ized by an association rule miner with fixed consequence. For the data set in Table 1, Ta-
ble 3 shows such an exhaustive rule set. Like us, many have pursued this idea, many have
come across the problem of the sheer mass of (even significant) rules, and many research
groups thus investigated how to handle redundance within the results (e.g., [10,11]).
These algorithms suppress patterns that are syntactically or statistically similar to others
that remain. That way however, they re-introduce the problem we wanted to get solved
by exhaustiveness: accidentally, meaningless patterns may suppress those that are truly
causal or at least provide the crucial hint.
To solve the goal conflict of desirable exhaustiveness versus a prohibiting mass of
patterns, interactivity once again was a significant step forward. We started off with pre-
senting a ranked list of association rules to the expert and then enable him to control
a CN2-SD-like sequential covering regime. Ranking allows him to find among the first
dozens of rules at least one which he recognizes as “interesting” or “already known”. He
picks it, modifies the instance set so as to remove this influence, and re-iterates to find
the next interesting rule. In contrast to the automatic regime, it is him to decide about
what is removed, and the danger that he misses something is much smaller.
While this basic idea of ranking patterns like search engine results and re-ranking
them based on interactively chosen rules was simple and appealing, we learned that in-
teractivity alone is not the magic word: We also had to find modes of visualization and
feedback that ease the flow of information needed to make interactivity effective.
Though often considered understandable, association rules show deficiencies to this
end. For two variables already, like in Table 3, you can hardly tell what impact the indi-
vidual items (selectors) have within a pattern, or what it would be like if you exchange
a selector with another. In other words, a singular rule provides too fragmented a know-
ledge as that it could be reasonably understood and then selected for the interactive ex-
ploration scheme described above.

3.2.2. What works

In rule cubes we found a type of pattern that overcomes these understandability concerns
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[12]. Rule cubes are named after the cubic structure that results from arraying those as-
sociation rules that belong to the same set of variables. They can equivalently be con-
sidered contingency tables. A pattern now is a set of one or more variables, stating that
these variables have some impact on the class variable.
The most obvious advantage over association rules is that rule cubes allow for quite
intuitive visualizations. Figure 6 shows an example, with lots of variants being possible.
That it is intuitive may be ascribed to that it actually does not display abstract patterns
but how the data distributes under this pattern, which is a very common mode of thinking
in statistics. Presenting entire attribute domains instead of singular values, these visual-
izations answer the aforementioned questions about neighboring rules and the roles of
the individual variables.
The idea of rule cubes turned out to be quite beneficial on the other aspects of our
interactive setup as well, namely ranking and feedback. Having the complete distribution
available, it is easier to implement a fairer ranking of joint influences by only consider-
ing their additional value over the superposition of their components. (This technique is
known from ANOVA or log-linear analysis.) Feeding back information that some influ-
ence is known and should be removed can now be implemented in a better way than by

CN2-SD-like downweighting of covered instances: namely by a relative interestingness

measure based on lift-covering, which is an adaption from [13]. See [14] for details.
Besides this basic re-ranking feedback, this relative interestingness measure allows
for even more interactive exploration: an interactive validation of influences. By intro-
ducing suppression strength as a ratio of cube interestingness and relative cube interest-
ingness, one can analyze the cube neighborhood and find out which influences are simi-
lar to a given one, or whether it might have been pushed by mere interaction with a more
plausible one (Figure 12).

Cruise-
yes 5.3% no 3.7%
Edition Control

Classic 7% 7% 7%

Avantgarde 2% 2% 2%

Figure 6. Rule cubes provide an intuitive visualization for a two-dimensional contingency table. The size of
the tiles shows the number of covered instances while the color encodes the fault rate (lift) on a color scale.
Calculating the suppression strength reveals that CruiseControl is pushed by variable Edition.

Figure 6 shows the rule cube for the example data set from Table 1. The fact that
CruiseControl is irrelevant in the light of knowing the variable Edition (i.e., the ﬁrst is
“suppressed by” the latter) becomes immediately apparent when looking at the conjoint
distribution.

4. A Tool that Suits the Experts

4.1. What we tried

We had a look at several commercially available data mining suites and tools. However,
none of these met the requirements outlined in Section 2.1.
As an overall observation, they were rather inaccessible and often did not allow for
interaction at the model building level. Even if they did, they could not present informa-
tion (like measures) in the non-statistician users’ language. Tools of this kind offer their
methods in a very generic fashion so that the typical domain expert does not know where
to start. In short, we believe that the goal conflict between flexibility and guidance can
hardly be solved by any general-purpose application, where the greatest simplification
potential, namely domain adaption, remains unexploited.

4.2. What works

To meet these domain adaption and simpliﬁcation requirements, we ﬁnally decided to

develop our own data mining tool. Figure 7 shows a simpliﬁed view of our tool’s process
model. It emerged as the union of our experts’ workﬂows und thus offers guidance even
for users not overly literate in data mining. At the same time, it does not constrain the

Build Tree
Model

Prepare Explore
Data
Build Cube
Model

Complexity upon request

Figure 7. Coarse usage model of our tool. There is a fixed process skeleton corresponding to the original
workflow. The user can just go through, or gain more flexibility (and complexity) upon request.

user to a single process but allows going deeper and gain flexibility wherever the user is
able and willing to.
Usually, the users start with extracting data for further analysis. We tried to keep this
step simple and hide the complexities as much as possible. The user just selects the ve-
hicle subset and the influence variables he likes to work with. A meta data based system
cares about joins, aggregations, discretizations or other data transformation steps. This
kind of preprocessing is domain specific, but still flexible enough to adapt to changes and
extensions.
In the course of their analyses, the experts often want to derive variables of their own.
That way, they can materialize concepts otherwise spread over several other conditions.
This is an important point where they introduce case-specific background knowledge.
The system allows them to do so, up to the full expressiveness of mathematical formulas.
A similar fashion of multi-level complexity is offered for the “Explore” box in Fig-
ure 7: The system offers both standard reports, suiting the experts’ needs in most of the
cases, up to individually configurable diagrams. For the sake of model induction, our tool
offers currently three branches that interact and complement each other: decision trees,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

rule sets and rule cubes.

The key property that makes a tool more than the sum of its components, however,
is the facility of interaction between its exploration and modelling components. Indeed
module interaction is the feature that allows users to ﬂexibly apply the methods offered
and to take out the respective best of them.
Such sometimes trivial but practically important features include:
• Extracting instance subsets as covered by a tree path or rule cube and exchanging
them within the modules for deeper analyses or visualization.
• Building a tree up to a certain level and then exchanging subsets of vehicles for
generating rule cubes to support a more in-depth causal investigation.
• Deriving new variables from tree paths or rule cubes.

5. Case Study

In this section we present a real-world case study to give an idea of our system and its
overall value. Imagine the following scenario: Several vehicles are brought to dealerships

because a lamp indicates an engine issue. Diagnostics information read from the engine
control module indicates some trouble with the exhaust system. Not knowing which part
exactly fails, dealers replace oxygen sensors on suspicion. Early warning systems for
warranty cost control show a significant increase in warranty claims for these sensors.
Quality engineers get alerted, but cannot find an explanation for the issue: The replaced
sensors are inspected and are okay. Yet, no other part seems to have failed.
The data analyst knows that only one engine type can set the fault code. Therefore he
restricts the data set to all instances with Opt_Engine=E. Now, the system shows a ranked
list of many possible influences (e.g. Opt_Emission, Mileage, BusinessCenter), confirm-
ing the engineer’s assumption that all service claims are related to the CARB states emis-
sions system (Figure 8).2 Note that although weighted lift ranks Opt_Emission highest,
other measures like information gain, gain ratio, or gini would not. The engineer’s prior
expectation that this variable is highly relevant, makes it the preferable choice. The re-
sulting tree is depicted in Figure 9.

Figure 8. Attribute Selection

After restricting the data set to vehicles with Opt_Emission=N the analyst expects
that State does no longer show up. However, State remains a high-ranked inﬂuence, and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

the north-eastern CARB states, especially New York, have extraordinarily high failure
rates. In our tool this is illustrated by a map that shows lift and recall for each state on a
color range from green to red (Figure 10(a)). Based on this surprising result, he derives
a new attribute CARBStates, that separates north-eastern carb states and California. The
analyst observes that most service events occur in CARB states although some vehicles
sold to CARB states run in other states. It is interesting that north-eastern CARB states still
show a much higher failure rate than California. A possible explanation for this could be
the stop-and-go traffic in New York.
Apart from the new attribute CARBStates, the attribute Min7Temp is ranked quite
high. The split preview in Figure 11 shows that the failure primarily occurs when the
minimum temperature within 1 week before repair date was low.
Now the question arises whether this is due to the fact that the minimum temperature
is lower in New York than in California, whether failures primarily occured during the
winter months or whether temperature is the true influence. The strength of rule cubes
2 The term CARB states (California Air Resources Board) refers to five US states that share very strict
emission laws: CA, NY, MA, ME, VT.

Figure 9. Decision Tree

no. vehicles sold

California NorthEast Others

CARBStates

(a) Map [15] (b) Nominal Split Chart

Figure 10. Figure 10(a) shows the distribution of failures after restricting instances to Opt_Engine=E and
Opt_Emission=N on a U.S. map (dark colors indicate a high lift and a high recall). The chart in Figure 10(b)
illustrates that most vehicles with this engine type and the speciﬁc emission standard are driven in California,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

whereas most oxygen sensor problems occured in north-eastern CARB states.

(a) Numeric Split Chart (b) Decision Tree Preview

Figure 11. Split preview for the variable Min7Temp. The temperature chart in Figure 11(a) and the decision
tree preview in Figure 11(b) indicate that failures are more likely to occur at low temperatures. In the chart, the
height of the bars visualizes the number of vehicles in total, while the bar color encodes the lift. The shaded
area highlights the split borders proposed by the split algorithm which can be adjusted manually.

Figure 12. Neighborhood of the influence CARBStates: In the three list boxes on the left, the tool suggests
potential causes, similar influences, and possibly caused influences, respectively. Min7Temp is listed among
the suggested causes. On selection of this variable the 2D-cube on the right provides details. It opposes the
CARBStates variable (horizontal axis) to the Min7Temp variable (vertical axis). Indeed, there are stronger
color (i.e., lift) differences over the Min7Temp axis than over the CARBStates axis in each row, and it is only
the statistical dependence (visible at the tile sizes) that makes the marginal at the bottom (the CARBStates
distribution) a seemingly strong influence.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 13. Variable Min7Temp remains an inﬂuence after the elimination of variable CARBStates: There is a
strong color gradient from red (left, cold temperatures) to green (right, warm temperatures).

lies in their ability to help answering questions like these. As our tool allows exchanging
data and combining various data analysis methods, the engineer can apply rule cubes to
the data set. Now the engineer can examine the neighborhood of CARBStates as depicted
in Figure 12. The tool suggests that the high rank of CARBStates might have been caused
by weather conditions, e.g. variable Min7Temp. As temperatures in California are higher
than in the north-eastern states (as well visible in Figure 12) the failure just does not
occur that often there.
To verify this, the engineer eliminates the influence CARBStates. CARBStates obvi-
ously gets rank 0, but Min7Temp is still ranked quite high. If however the user eliminates
the influence Min7Temp the influence CARBStates almost vanishes (Figure 14). Note
that, despite this interaction the cube CARBStates × Min7Temp is ranked low (1.95),

Figure 14. Variable CARBStates is ranked quite low after the elimination of the inﬂuence Min7Temp; all three
tiles are in yellowish colors, representing lift values near 1.

since the main inﬂuence is the temperature, and there is no interaction of these two as to
cause the issue.
Based on these results the engineer actually ﬁnds out that there was a calibration
issue that sometimes caused the engine control module to set a diagnostic code when the
vehicle was driven at cold temperatures in wide open throttle mode.

6. Conclusion

We reported on our experiences of applying data mining methods in automotive quality

data analysis to support root cause investigations. The requirements of this task and spe-
ciﬁc data properties make it rather impossible that a fully automated process can be ap-
plied. Analysis tasks change structurally case by case, and thus a great amount of back-
ground knowledge is indispensable. This knowledge is difﬁcult to formalize, but interac-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

tivity can close the gap: If a user is strongly involved in the process of building a model,
the resulting model will not only maximize a statistically motivated scoring function, but
the user’s personal importance score. Then, the posterior probability of the most likely
hypothesis also contains the user’s prior expectations.
Many approaches suggested in literature turned out either too constrained or too
complex to be offered without major adaption. In such a setting, we consider it best
to stick to simple methods, provide these in a both flexible and understandable way,
and settle on interactivity. We presented interactive decision trees, rule sets, and finally
interactive rule cubes as three data mining methods that fit our requirements best. A
special focus is on the notion of causality.
Commercially available tools proved too inflexible in regards to domain adaptation.
As our users are no data mining experts, the application workflow must follow the users’
workflows and not vice versa. There is a fixed process skeleton corresponding to the
original workflow and a regular user can just go through this process, while a power user
gains more flexibility (and complexity) upon request. A real world case study showed
how our tool can be applied in practice.

References

[1] David J. Hand. Data mining—reaching beyond statistics. Research in Official Statistics, 2:5–17, 1998.
[2] Willi Klösgen. Applications and research problems of subgroup mining. In Proceedings of the Eleventh
International Symposium on Foundations of Intelligent Systems, 1999.
[3] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regres-
sion Trees. Chapman & Hall, 1984.
[4] G. V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied
Statistics, 29:119–127, 1980.
[5] John Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
[6] Markus Mueller, Christoph Schlieder, and Axel Blumenstock. Application of bayesian partition models
in warranty data analysis. In Proceedings of the Ninth SIAM International Conference on Data Mining
(SDM) (accepted), 2009.
[7] Willi Klösgen. EXPLORA: A multipattern and multistrategy discovery assistant. In Advances in Know-
ledge Discovery and Data Mining, pages 249–271. American Association for Artificial Intelligence,
Menlo Park, CA, USA, 1996.
[8] Stefan Wrobel. An algorithm for multi-relational discovery of subgroups. In Proceedings of the First
European Symposium on Principles of Data Mining and Knowledge Discovery (PKDD-97), 1997.
[9] Nada Lavrač, Peter A. Flach, Branko Kavšek, and Ljupčo Todorovski. Rule induction for subgroup
discovery with CN2-SD. In ECML/PKDD’02 Workshop on Integration and Collaboration Aspects of
Data Mining, Decision Support and Meta-Learning, 2002.
[10] Bing Liu, Minqing Hu, and Wynne Hsu. Multi-level organization and summarization of the discovered
rules. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 2000.
[11] Xifeng Yan, Hong Cheng, Jiawei Han, and Dong Xin. Summarizing itemset patterns: a profile-based
approach. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, 2005.
[12] Kaidi Zhao, Bing Liu, Jeffrey Benkler, and Weimin Xiao. Opportunity map: Identifying causes of failure
– a deployed data mining system. In Proceedings of the Twelfth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, pages 892–901, 2006.
[13] Martin Scholz. Knowledge-based sampling for subgroup discovery. In Lecture Notes in Computer
Science, volume 3539, pages 171–189, 2005.
[14] Axel Blumenstock, Franz Schweiggert, Markus Müller, and Carsten Lanquillon. Rule cubes for causal
investigations. Knowledge and Information Systems, 2008.
[15] Wikimedia. Map of usa with state names. http://en.wikipedia.org/wiki/File:Map_of_
USA_with_state_names.svg, 2007. Last accessed: 2008-12-10.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Best Practices for Predictive Analytics in

B2B Financial Services
Raul DOMINGOS1, Thierry VAN DE MERCKT

VADIS Consulting

Abstract. Predictive analytics is a well known practice among corporations having

business with private consumers (B2C) as a means to achieve competitive
advantage. The first part of this article intends to show that corporations operating
in a business to business (B2B) setting have similar conditions to use predictive
analytics on their favor. Predictive analytics can be applied to solve a myriad of
business problems. The solutions to solve some of these problems are well known
while the resolution of other problems requires quite an amount of research and
innovation. However, predictive analytics professionals tend to solve similar
problems in very different ways, even those to which there are known best
practices. The second part of this article uses predictive analytics applications
identified in a B2B context to describe a set of best practices to solve well known
problems (the “let’s not re-invent the wheel” attitude) and innovative practices to
solve challenging problems.

Introduction

The work that served as the motivation for this paper consisted on the development of
systems to generate profitable sales leads for current customers and non-customers.
These leads are the output of predictive models that are created following
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

segmentations of the customers and non-customers database. Due to the large number
of predictive models often required to cover all the database segments, the application
of “best practices” is necessary to render the exercise feasible.
Such a system, enabling the identification and usage of best practices is the
object of the second part of the paper.

1. Predictive Analytics for B2B

Predictive analytics has been used already for several years by corporations as a source
of competitive advantage. The pioneers in this domain have been B2C businesses with
a large base of consumers and the capacity to collect and store detailed socio-
demographical and transactional data from their customers. Most of these corporations
have applied predictive analytics to learn about their customers but only a few have
done it to learn about their non-customers due to the difficulty of capturing reliable
data about non-customers. The proliferation of data providers and the evolution of the

1
Corresponding Author : Raul Domingos, VADIS Consulting, Researchdreef 65 Allée de la
Recherche, 1070 Anderlecht, Belgium ; E-mail : [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
36 R. Domingos and T. Van de Merckt / Best Practices for Predictive Analytics

technical infrastructure to manage huge amounts of data may turn things easier but
concerns about data privacy can be a major barrier to exploit external data sources
about consumers, especially in Europe where data privacy regulation is more
constraining than elsewhere [1].
This part of the article describes the business design of how predictive analytics
can be used to generate the most profitable sales leads for each of both non-customers
and customers of a financial services provider in a B2B setting. The business offer
available in this business context consists of products and services to attain a
diversified set of goals such as managing receivables, controlling financial risks,
optimize the treasury, secure and finance international business or transform the
company capital structure.
The factors that determine the business lead profitability are relatively different
from non-customers to customers. The first section describes in detail the factors that
influence the potential profitability of a business lead for non-customers. The second
section describes in detail what changes for each factor from non-customers to
customers.

1.1. Profitable Sales Leads for Non-Customers

The profitability of a business lead with non-customers is dependent at least of three

factors:

• Offer Acceptance: What is the chance that the non-customer is interested in

the business offer?
• Business Value: What is the expected value that the non-customer will
generate?
• Risk of Bankruptcy: What is the chance that the non-customer business will
be terminated?

1.1.1. Offer Acceptance

An offer can be accepted in different scenarios following two dimensions: “Has

the offer”; and “Needs the offer”. This is illustrated in figure 1 “Opportunity
Scenarios”.

The hypothesis that the non-customer may already have the offer with some
competitor has to be taken into account with the existence (or not) for needing even
more of that offer. This is equivalent to say that the competitor may have an up-sell
opportunity. The commercial interest of the situations depicted in the previous matrix is
not the same for all financial services.
A “Customer Migration” illustrates the scenario when the non-customer already
has the financial service with some competitor and does not have a need for more of
that offer. The challenge to the sales representative is to convince the non-customer
that it is interesting to migrate the financial service from the competitor to the sales
representative Bank. There is only a realistic commercial opportunity if there are no big
business barriers to that migration. For instance, the migration of a short term credit is
easier than of a long term credit.
A “Competitive Opportunity” illustrates the scenario when the non-customer
already has the financial service with some competitor but still has the need for more of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

that offer. The sales representative will want to convince the non-customer to migrate
the financial service to the sales representative Bank, like in the previous case, but now
together with an up-sell opportunity.
An “Unattended Opportunity” is probably the most interesting scenario for a sales
representative. The non-customer has a need for a new financial service.
If the non-customer has neither the service nor the need of acquiring it, there is
indeed “No Opportunity”.
To estimate the location of each non-customer in the matrix of commercial
scenarios, there are two modelling approaches that will help to align the non-customer
accordingly to each one of the matrix dimensions:

• Propensity of Having (P2H): This classification model will estimate if a non-

customer already possesses or not a certain financial service. The output of a
P2H allows discriminating between the higher and the lower part of the
matrix.
• Propensity to Buy (P2B): This prediction model will estimate if a non-
customer has or not an untapped need of a certain financial service. The output

of a P2B allows discriminating between the left and the right part of the
matrix.

Both models together will allow to spot opportunities in the all matrix.
To estimate that a company already has (or not) a financial service in a specific
period is always possible using information like the company financial data. For
instance, companies must declare explicitly the liabilities they have towards financial
institutions in their balance sheets. However, the prediction about the need that a
company may have for some financial service is not always viable. Since this
prediction is about a specific moment in the company life time, the concept is not
applicable for financial services that are recurrently acquired by a company without the
opportunity for upgrade or up-sell (e.g. tax credit). The condition to apply this
dimension remains in the ability to identify 2 consecutive periods in the company life
time, where the company doesn’t have the service in the first period and has the service
in the second period (e.g. long term credit). Alternatively, the company may already
have the service in the first period but acquires more of the same service in the second
period. These concepts are illustrated in figure 2 “Time Windows”.
If the data used in the predictive exercise is on a yearly basis (like the company
financial data), these periods have to be yearly as well.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

For each financial service, the following questions can be used to determine which
modeling approaches (P2H; P2B) to calculate:

• Is the financial service acquired frequently and recurrently offering none or

low opportunities of up-selling?
Yes – Use only P2H. Stop.
No – Continue.

• Are there high business migration barriers to change the financial service
between two banks?

Yes – Use only P2B. Stop.

No – Both P2H and P2B can be interesting. Stop.

1.1.2. Business Value

While the selection of sales leads based on high chance of offer acceptance may be
sufficient to achieve some business goals such as market share, this is not enough to
maximize the bottom line. The selection of sales leads must be based both on high
chance of offer acceptance and high estimation of value generated through that offer or
through the totality of the new customer relation. The activation of each opportunity
scenario based on value is illustrated in figure 3 “Opportunity Scenarios with Value
Perspective”.

In a scenario of “Customer Migration”, the value estimation from the overall new
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

customer relation is more important to select the best non-customers to “steal” from the
competition. This normally requires higher investments in the relationship that must be
justified with high expectations of future value generation.
In a scenario of “Unattended Opportunity”, the leads prioritization should be based
on the value estimation of the offer since the acquisition is based on an existing need
still not addressed by the competition. The value linked to this current need should be
easier to fulfill than any other value expectation dependent of a positive development
from a future business relationship.
In a scenario of “Competitive Opportunity” both value estimations should be taken
into consideration.
There are at least two approaches to define the business value to be estimated:

• The value that the new customer will generate is based on the sales
representative Bank best practices. This means that the commercial modus
operandi stays the same.
• The value that the new customer will generate is based on “perfect” practices.
This is equivalent to the total customer wallet estimation.

If the Bank doesn’t intend to dramatically improve its commercial best practices,
than the first approach is advised for non-customers. The major technical differences of
implementing one or another approach are further explained in the second part of this
article.

1.1.3. Risk of Bankruptcy

Of all the factors that can contradict the success or failure of the leads generated (e.g.
depressive market conditions, competition, sales force inability, hazard – this accounts
for all other factors not foreseen), there is one factor that can be controlled through one
additional key performance predictor. This is the possibility that the business from the
non-customer may be terminated.
The commercial “hunting” activity is a continue balance between two concurrent
business goals: acquisition of new customers; and low risk of customers in portfolio.
The risk management with acquisition of new customers is specially challenging due to
the lack of knowledge about those potential new customers.
The existence of estimation about the risk of bankruptcy for each non-customer
allows avoiding companies that, despite their high chance of offer acceptance and
business value estimation, can quickly destroy any prospect of profitability going into
bankruptcy.

1.2. Profitable Sales Leads for Customers

The maximization of customer profitability certainly depends of how the business

relations with both non-customers (new customers) and existing customers are
managed. In this part of the document, the particularities of business relations with
customers are further described.
Before delving into the details, a distinction must be made among different sets of
customers.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The set of customers analyzed are illustrated in figure 4 “Customers Pyramid”.

Customers have different levels of activity in their business relation with its
financial services provider. The line between active and inactive customers can be
defined in different ways but in this context let us assume that inactive customers are

companies that acquired some financial service in the past but for the last 12 months
have not engaged in any kind of interaction with their Bank and the business value
generated was null. These customers can indeed be considered just like any non-
customer for commercial “hunting” purposes. Of course, the sales representative
should be aware about the existing history with that company when approaching it
again.
Active customers that generate a negative value must be further analyzed to
understand the reasons that drive them to be unprofitable. Those customers that are
generating losses due to costs associated to default of credit payments or any other kind
of delinquency behavior should be excluded from any marketing approach to maximize
customer profitability. Those customers that generate losses because of administrative
costs (e.g. intensive usage of costly channels such as branches for low value
operations) should be considered an opportunity for business value improvement just
like any other active customer that already generates a positive value.

1.2.1. Offer Acceptance

The overall approach of depicting the commercial opportunities across the dimensions
of “Has the service” and “Needs the service” stays the same. A customer can either
have a certain financial service with the sales representative Bank or with some
competitor. The fundamental difference has to do with the way the P2H and P2B
models should be developed for customers.
From the sales representative Bank perspective, the commercial opportunities with
existing customers are either cross-sell or up-sell opportunities. To score customers
using models developed for non-customers is the same as to consider the commercial
opportunities with customers as first-time acquisitions. The Bank would score its
customers just as well as any of its competitors can do it! The information about the
financial services that those companies own or have owned with the Bank and the
usage they have, is a major knowledge asset that should be exploited by the Bank when
managing its portfolio of customers.

1.2.2. Business Value

Just as for non-customers, there are the same two possible approaches to define the
business value to be estimated. They are repeated next:

The creation of a value estimation based on the first approach is much less relevant
for customers. The best indicator of the future value that financial services customer
will generate based on “business as usual” is the past value generated by that customer.
The optimization of the profitability of customers is achieved by identifying those
customers where there is a potential to tap. The “Growth Value Matrix” in figure 5
illustrates a customer segmentation based on the current value and the growth potential.

The growth potential can be defined as the customer wallet share that is not
captured by the Bank ((total customer wallet – current value) / current value). The total
customer wallet estimation is the value model proposed for customers.

1.2.3. Risk of Bankruptcy

The risk of bankruptcy is equivalent for non-customers and customers. Again, the
difference remains in the amount of additional data that a Bank can collect about its
customers. This data allows calculating more detailed models that should be compliant
with the industry risk standards such as Basel II [2].

1.2.4. Attrition Score

There is an additional factor to consider when assessing the profitability of leads for
customers:
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

• Attrition Score: What is the expected life time of the business relationship
with the Bank?

The estimated profitability of sales leads is only achieved if the customer stays
around enough time doing business with the Bank. Despite the interesting profiles that
some customers may have in terms of offer acceptance, business value estimation, and
risk of bankruptcy, care must be taken not to invest too much into business relations
that risk to be terminated too soon by the customer.
Customers may terminate their relationship with the Bank either explicitly
(accounts and other products are closed) or silently (accounts and other products are no
longer used or are used to a minimum). Both types of attrition can be foreseen using
predictive models.
The knowledge about which customer relations have a high risk of attrition allows
us to act proactively instead of reacting when it may be too late to keep the customer
actively in the portfolio.

1.3. Summary

All the factors associated to the profitability of the sales leads can be fairly well
estimated using the companies’ financial information which is public data. This is
much more detailed information than the information that any commercial company
will be ever able to collect from personal consumers if life privacy laws are not
loosened up. The publicly data available about companies is the equivalent of knowing
the assets, liabilities, main sources of income, main expenses types and family
composition of personal consumers!
In the previous sections, the most important distinctive points on how to manage
non-customers and customers to maximize profitability were identified. The summary
of these distinctive characteristics is presented in table 1. “Customers versus Non
Customers”.

Table 1. Customers vs. Non Customers

Non Customers Customers
Offer Acceptance Models for first time Models for cross-sell and up-
acquisition based on external sell opportunities based on
data. external and internal data.
Business Value Value estimations based on Total customer wallet
the Bank best practices and estimation using external and
external data. internal data to allow the
identification of customers
where there is a higher growth
potential.
Risk of Bankruptcy Risk of bankruptcy Risk models of credit default
estimation based on external or bankruptcy compliant with
data. Basel II.
Attrition Score Not applicable. Detect customer attrition both
explicit and silent.

The translation of the business design just described into a solution based on
predictive analytics requires the implementation of many models. The implementation
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

and management of all these models require the adoption of effective methodologies
supported on best practices. The identification and usage of these best practices is the
subject of the next part of this paper.

2. Innovative Practices vs. Best Practices

By reading a sample of reports about predictive analytics projects, one realizes that the
practice of predictive analytics is probably one of the domains with more variations on
how professionals solve similar problems. This is positive for problems for which best
practices are unknown but can be a factor of inefficacity for well known problems.
In the context of this paper, the concept of “best practices” means the set of
knowledge that is applicable to most instances of the same problem most of the time
and its application can enhance the chances of successfully solving those problems.
We can speculate about some of the reasons that are behind this behavior:

• The amount of research in the domain is quite high which translates in the
diversity of analytical methodologies available. These analytical
methodologies define how the data is prepared and which families of learning
algorithms are employed.
• The lack of a body of knowledge formally described in a single book that
should contain the set of principles that would be generally accepted as best
practices by all the practitioners (the laws of predictive analytics).
• The hybrid nature of predictive analytics blending different fields of sciences
(e.g. machine learning, statistics, mathematics) with business.

Another factor that certainly contributes to this state of things is the lack of
predictive analytics systems in the market with functionalities that strike the right
balance between exploitation of best practices and exploration of new approaches.
These systems could help both the novice practitioner in avoiding to make fatal errors
and the experienced practitioner in focusing on open issues while the system would
give precise guidance for recurrent issues without any extra distraction. Most of the
predictive analytics systems that exist in the market are in the form of a workbench [3].
These systems typically provide:

• many families of learning algorithms with available parameters at different

levels of depth;
• different methods of data manipulation to implement a model creation-
validation setup;
• several ways of evaluating the different predictive models created;

This is good for practitioners that need to explore different approaches to solve
sophisticated problems but they are too open ended to efficiently tackle classic
predictive analysis. The following section will describe the different predictive models
required to deliver the vision explained in the first part of the article. We will see that
there are different levels of complexity in this set of predictive models.

2.1. Predictive Models of Different Complexity

Let us take as an example the different predictive models required to deliver the
predictive analyses for B2B described in the first part of this article. Considering 10
distinct financial services provided by a Bank, we would need to develop about:

• 10 P2H models and 10 P2B models for non-customers and customers;

• 1 global customer value model for non-customers and customers;
• 10 value models for non-customers (one per financial service) and customers;
• 1 risk of bankruptcy model for non-customers and customers;
• 1 attrition model for customers;

This gives a total of 65 models and we are assuming here that there would not be
any pre-modelling segmentation of either non-customers or customers. A possible
segmentation could be to split companies in small, medium and large companies. This
would increase even more the number of required models to deliver the business design
described in the first part of this article. The number of models to give a good picture

of the bankruptcy and attrition risk might be more than 2 but for the sake of simplicity
we consider here only 2 models.
Predictive models can be classified according to many criteria. For the purposes of
best practices identification, we will classify the 65 models according to the nature of
the target variable: the target variable can either be a binary variable or a continuous
numerical variable. The 65 models can be divided into three levels of complexity:

• There are 43 models with a binary target variable. These are the 20 P2H, the
20 P2B and the models to estimate the risk of bankruptcy and attrition.
• There are 11 models with a continuous numerical target variable for non-
customers. These are the 10 models to estimate the value that will be
generated through a specific financial service plus 1 model to estimate the
overall value generated by the non-customer.
• There are another 11 models with a continuous numerical target variable but
for customers.

The set of models that are the easiest to create are the 43 binary models. There are
several bibliographical references (e.g. [4]) about the creation of binary models. We
can find many receipts for the problems related to the creation of a binary model
ranging from features selection to avoidance of overfitting.
The continuous models try to estimate which would be the value generated if the
Bank could capture the total wallet of the customer. This wallet can be shared by
different banks and there can even be a part of the wallet that is simply not reached by
any of the banks (One can argue that this last case only happens at companies with lazy
chief financial officers!).
The fact that the numerical value to be estimated is unknown (the total wallet size)
turns the development of the continuous models particularly difficult. For the binary
models, the target concerns the ownership (or not) of a product with the Bank. This is
an event that can be precisely identified with the internal data.

2.2. Innovative Practices for Share of Wallet Estimation

The aim of this part of the article is not to delve into all the details of sophisticated
problems as the estimation of share of wallet but to expand on the best practices to
solve classical problems such as the creation of a binary model. However, the problem
of share of wallet estimation is explained further here to better illustrate the situation
when innovation should be the tone instead of applying best practices.
Basically, the total wallet size of a company for financial services is the total
amount of money that one could expect that a company would either invest in financial
assets or borrow from Banks at a certain period (e.g. a year). Since this amount is not
readily observable, a possible alternative is to directly ask to companies for this amount
for example through a survey. There is more than one obstacle to complete this task
successfully (not even to mention the ability to convince companies to answer this kind
of inquiries):

• reach the right people at each company that will be actually able to identify
the right answer;
• make sure that the answer is not just related to a part of the company but really
to the entire company;

• make sure that every company understands in the same way the questions that
are being made in the survey so we can be sure that we are comparing apples
with apples from the survey’s replies;

A more analytical (and feasible) approach is to consider that the total wallet size of
a certain company should be no less than the actual amounts that “similar” companies
already are spending with the Bank. The trick here is to get the part of deciding who is
similar to whom right. It just so happens that there is an analytical tool that gives
exactly this information: quantile regressions. Just like a classical regression analysis
will estimate the average of the distribution of the independent variable for
observations close one of the other in the observational space, quantile regressions will
do exactly the same but instead of the average will achieve the regression of some
precise quantile at our wish. For instance, if we decide to estimate the 90th percentile,
this is equivalent to consider the total wallet size of a company as the observation point
that can be found on the 90th percentile of the distribution of the actual values for
“similar” companies as identified by the quantile regression model itself.
One detailed example of the application of quantile regression to estimate the share
of wallet can be found in [5]. The technique of quantile regression is fairly unknown to
the mainstream analyst but is by no means a recent invention. Just as Tuckey puts it
already back in 1977 [6]:

“What the regression curve does is give a grand summary for the averages of the
distributions corresponding to the set of x’s. We could go further and compute several
different regression curves corresponding to the various percentage points of the
distributions and thus get a more complete picture of the set. Ordinarily this is not
done, and so regression often gives a rather incomplete picture. Just as the mean gives
an incomplete picture of a single distribution, so the regression curve gives a
corresponding incomplete picture for a set of distributions.”

2.3. Best Practices for a Binary Model Creation

The problem at hands consists in developing binary models about the ownership (or
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

not) of a financial product. This is a common task among projects related to data
mining. This task should be executed applying best practices instead of trying to re-
invent the creation process of a binary model.
Virtually any analytical assignment can be decomposed into four main phases:

a) Define: Defining the problem that needs to be addressed.

b) Design: Designing the problem solution through an analytical approach.
c) Implement: Implementing the analytical approach previously designed.
d) Deploy: Deploying the solution implemented to effectively solve the problem.

The problem understanding and analytical solution design are the critical and
distinguishable phases of an analytical assignment. As the moto goes, the right answer
to the wrong question is no better than a plain wrong answer to the right question.
What is meant by predictive analytics best practices has to do with the process of
implementation. At this point of the assignment, the analyst typically has a flat data file
as input and a binary model is the expected output. Most of the analysts reading this
text will identify themselves with the following set of questions:

• How can I select the right variables to create my model when my flat data file
contains thousands of input variables?
• Should I treat in any special way the outliers? How?
• What about missing values? Are they going to affect my model if I do nothing
about them?
• Which learning algorithm should I use? Should I use different techniques
through an ensemble model approach?
• Depending of the learning algorithm that I am going to use, which statistical
data assumptions should I validate before I employ it?
• How should I use the available data in the flat file to avoid the overfitting
trap?
• How should I use the available data in the flat data file to make sure that I will
be able to correctly evaluate the generalization capacity of my final model?
• Do I have enough positive cases to really learn some patterns or do I need to
do something to take care of this?
• Assuming that my learning algorithm of choice has parameters to fine tune,
which settings should I use?
• Exactly which evaluation measures will give me the most relevant insight
about the quality of my model?
• Is it ok to use my model output as it comes or do I need something else like a
probability?

Why is it necessary to address all these questions over and over again on each
analytical assignment and devise a solution from scratch? The time spent in this way
would have a much better pay-off if applied in the define and design phases instead.
The best practices should point out to a set of analytical approaches that enable to
flawlessly achieve a certain task within a limited amount of time. What we advocate is
exactly a predictive analytics system that receives as input the flat data file with the
identification of which variable is target and which variables are possible inputs and
delivers as output a binary model.
This predictive analytics system will take care of all the questions enlisted above
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

using embedded best practices that applied together can make a huge difference in the
quality of the final model produced.
While this system can’t assure the best possible model at any scenario, a quite
good model can be achieved in a fairly short time. The system was used in the PAKDD
2007 competition achieving the 6th place [7]. The total effort invested by the analyst in
this competition was of 1 man-day. This goes from reading the contest rules and
understanding the problem and the data until achieving the final model reported.
The curious reader may find more information of the best practices in the paper
[8]. Equipped with such a system, the task of creating 43 binary models in a project
life-time with flat data files in the order of thousands of variables and hundreds of
thousands of rows do not look so daunting.

3. Conclusion

This text has described the potential of using predictive analytics in a B2B setting.
There is much more data available about companies than predictive analytics

practitioners may think. Additionally, the success of applying predictive analytics in a

business environment depends on the correct preliminary business analysis which
consists of defining the related business process integrated with the right predictive
models at the right places. The amount of models required to cover all the business
needs can grow rather quickly depending of the business needs (e.g. different customer
segments to cover depending of their size).
The second part of this text describes the importance of knowing when to explore
innovative practices to tackle sophisticated analytical problems and apply best practices
to solve classical problems. This can mean the difference between an intellectual
startling yet ruinous assignment and a fulfilled and profitable analytical assignment,
especially if the amount of models to create is fairly high.
The ideas of this paper were exposed in the context of supervised learning tasks
applied to marketing challenges in the financial services industry. While the concepts
presented in the first part of this paper were developed specifically for the financial
services industry, the trade off about between using best practices versus innovative
approaches is also a matter of any data mining task (e.g. segmentation) applied to any
business challenges (e.g. risk management) of other industries (e.g. telecom, retail).

References

[1] The European Commission Home Page for Data Protection as of May 2008
http://ec.europa.eu/justice_home/fsj/privacy/index_en.htm

[2] Web site of Bank for International Settlements as of May 2008 http://www.bis.org/

[3] Poll about data mining software used in real projects as of May 2008
http://www.kdnuggets.com/polls/2008/data-mining-software-tools-used.htm

[4] P.N. Tan, M. Steinbach, V. Kumar. Introduction to Data Mining. In Addison-Wesley Longman
Publishing Co., Inc. , 2005

[5] S. Rosset, C. Perlich, B. Zadrozny, S. Merugu, S. Weiss, and R. Lawrence. Wallet estimation models. In
International Workshop on Customer Relationship Management: Data Mining Meets Marketing, 2005.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[6] F. Mosteller, and J. Tukey. Data Analysis and Regression: A Second Course in Statistics. In Reading,
Mass.: Addison-Wesley, 1977.

[7] Web site of PAKDD 2007 Data Mining Competition as of May 2008
http://lamda.nju.edu.cn/conf/pakdd07/dmc07/index.htm

[8] T.V. de Merckt, J.F. Chevalier. PaKDD-2007: A Near-Linear Model for the Cross-Selling Problem. In
International Journal of Data Warehousing and Mining, Vol. 4, Issue 2, 2008

Towards the Generic Framework for

Utility Considerations in Data Mining
Research
Seppo PUURONEN a and Mykola PECHENIZKIY b,1
a University of Jyväskylä, Finland
b Eindhoven University of Technology, the Netherlands

Abstract. Rigor data mining (DM) research has successfully developed advanced
data mining techniques and algorithms, and many organizations have great expec-
tations to take more beneﬁt of their vast data warehouses in decision making. Even
when there are some success stories the current status in practice is mainly includ-
ing great expectations that have not yet been fulﬁlled. DM researchers have recently
become interested in utility-based DM (UBDM) starting to consider some of the
economic utility factors (like cost of data, cost of measurement, cost of class label
and so forth), but yet many other utility factors are left outside the main directions
of UBDM. The goal of this position paper is (1) to motivate researchers to consider
utility from broader perspective than usually done in UBDM context and (2) to in-
troduce a new generic framework for these broader utility considerations in DM re-
search. Besides describing our multi-criteria utility based framework (MCUF) we
present a few hypothetical examples showing how the framework might be used to
consider utilities of some potential DM research stakeholders.
Keywords. utility-based data mining, data mining stakeholders, rigor vs. relevance
in research
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Introduction

Nowadays, the rapid growth of IT has brought tremendous opportunities for data collec-
tion, data sharing, data integration, and intelligent data analysis across multiple (poten-
tially distributed and heterogeneous) data sources. Since the 90s business intelligence has
started to play an increasing role in many organizations. Data warehousing and data min-
ing (DM) are becoming more and more popular tools to facilitate knowledge discovery
and contribute to decision making.
Yet, DM is still a technology having great expectations to enable organizations to
take more beneﬁt of their huge databases. There exist some success stories where or-
ganizations have managed to take competitive advantage of DM. Still the strong focus
of most DM-researchers in technology-oriented topics does not support expanding the
scope in less rigorous but practically very relevant research topics. The current situation
with DM has similarities with situations during the development of some other informa-
1 Corresponding Author: Department of Computer Science, Eindhoven University of Technology,
P.O. Box 513, 5600 MB Eindhoven, the Netherlands; E-mail: [email protected].
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
50 S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations

tion technology (IT)-related sub-areas earlier. Research in the Information Systems (IS)
discipline (one of those IT-related sub-areas) has strong traditions to take into account
human and organizational aspects of systems beside the technical ones.
We have suggested a provocative discussion on why DM does not contribute to busi-
ness [30] and emphasized further in [32] that user and organization related research re-
sults and organizational settings used in IS discipline include essential points of view
which might be reasonable to take into account in developing DM research towards prac-
tically more relevant directions in domain areas where human and organizational things
matter. As IS also DM research has several stakeholders, the majority of which can be
divided into internal and external ones each having their own and commonly conﬂicting
goals. Currently, DM researchers rarely take industry (the most important external stake-
holder) into account while conducting their often rigorous research activities. This holds
even in the industry context where meaning, design, use, and structure of a DM artifact
is an important topic. The situation is still more complicated because outputs vary signif-
icantly by industry affecting the meaning and measurement of utility and performance.
Although, recent development in cost-sensitive learning and active learning has
started to consider some of the economic utility factors (like cost of data, cost of mea-
surement, cost of class label and so forth), yet many other utility factors are left outside
the main directions of the emerging utility-based DM research (UBDM).
For us DM is inseparably included as an essential part of the knowledge discovery
process and we see that a more holistic view of DM research is needed. If we, as DM
researchers, want to participate this kind of research efforts then we need to take under
investigation also utility related topics. Simple assessment measures like predictive ac-
curacy have to give way to economic utility measures, such as proﬁtability and return on
investment. But on the other hand DM systems have their own peculiarities as IS systems
which should be taken into account also in the holistic view of DM systems research.
Thus, the goal of this paper is (1) to motivate the DM researchers to consider the
possibility to take utility aspects into account from a broader perspective than is usually
done in UBDM context nowadays and (2) to introduce a generic framework for utility
considerations in DM research with a few examples from the point of view of some
hypothetical DM research stakeholders.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The rest of the paper is organized as follows. In Section 1 we introduce DM as a

process and very brieﬂy review and summarize the recent research directions in UBDM,
which were reﬂected in UBDM-05 and UBDM-06 workshops [39][42][26], and some
other relevant publication. In Section 2 we motivate the broadening of the utility concept
in the context of UBDM, emphasizing the importance of business understanding and
necessity to analyze the process of using/applying the developed DM artifact to the real
context. In Section 3 we continue analysis of DM research stakeholders resulting to our
new generic multi-criteria utility based framework (MCUF). In Section 4 we present the
revisited view on DM process from UBDM perspective and show some examples of how
our MCUF might be used to describe situation from a few stakeholder point of view in
some context. Section 5 concludes the paper and discusses future work.

1. DM process and UBDM

Fayyad in [16] deﬁnes knowledge discovery from databases KDD as “the nontrivial pro-
cess of identifying valid, novel, potentially useful, and ultimately understandable pat-

terns in data”. Before focusing on discussion of the DM process, we would like to make
a note that this definition by Fayyad is very capacious, it gives an idea what is the goal of
DM and in fact it is cited in many DM related papers in introductory sections. However,
in many cases those papers have nothing to do with novelty, interestingness, potential
usefulness and validity of patterns which were discovered or could be discovered using
the DM techniques proposed later in those papers.
DM process comprises many steps, which involve data selection, data preprocess-
ing, data transformation, search for patterns, and interpretation and evaluation of patterns
[16]. These steps start with the raw data and finish with the extracted knowledge, which
was acquired as a result of the whole DM/KDD process. The set of DM tasks used to
extract and verify patterns in data is the core of the process. Most of current DM/KDD2
research is dedicated to the pattern mining algorithms, descriptive and predictive model-
ing of data. Nevertheless, this core process of search for potentially useful patterns typ-
ically takes only a small part (estimated at 15% - 25%) of the effort of the overall KDD
process. The additional steps of the KDD process, such as data preparation, data selec-
tion, data cleaning, incorporating appropriate prior knowledge, and proper interpretation
of the results of mining, are also essential to derive useful knowledge from data.
The life cycle of a DM project according to the CRISP-DM model (Figure 1) con-
sists of six phases (though the sequence of the phases is not strict and moving back and
forth between different phases normally happens) [8]. The arrows indicate the most im-
portant and frequent dependencies between phases. And the outer circle in the figure
denotes the cyclic nature of DM – a DM process continues after a solution has been
deployed. If some lessons are learnt during the process, some new and likely more fo-
cused business questions can be recognized and subsequently new DM processes will be
launched.
CRISP-DM has much overlapping with Fayyad’s view. However, we would like to
emphasize that the DM process is put now explicitly in a way into some business context
that is represented by the business understanding and deployment blocks.
For us it is natural to define utility as a measure of overall (e.g. economic) benefit,
in UBDM and thus the concept of utility, consequently, should be connected with the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

entire DM process. In the rest of this section we review UBDM research directions and
summarize the steps of the CRISP-DM process which they seem to take into account.
Considerations of costs and benefits are common for all managerial decisions in or-
ganizations. Consequently, the quality of a DM artifact and its output must be evaluated
considering its ability to enhance the quality of the resulting decision. Most of early work
in predictive DM did not address the different practical issues related to data preparation,
model induction and its evaluation and application. Cost-sensitive learning [38] research
emerged initially in DM as an effort to reflect the relevance of incorporating the costs
resulting from a decision (based on prediction of DM model). Many application areas
of DM suggested that e.g. for classification, the costs to predict class membership of
instances accurately are proportional to the amount of accurately predicted instances yet
needed to account for the asymmetric costs associated with true versus false prediction
2 We would like to clarify that according to Fayyad’s definition, and some other research literature, DM is
commonly referred to as a particular phase of the entire process of turning raw data into valuable knowledge,
and it includes the application of modeling and discovery algorithms. In industry, however, both knowledge
discovery and DM terms are often used as synonyms to the entire process of producing valuable knowledge
from data.

Figure 1. CRISP-DM: CRoss Industry Standard Process for Data Mining [8]

of positives and negatives. The knowledge of this asymmetry can be used to guide the
parameterization of a classiﬁer and selection of the most appropriate one (e.g. MetaCost
[14] or cost sensitive boosting [15]). This leads to the development of robust evalua-
tion techniques like the ROC convex hull method [31] or the area under the ROC curve
(AUC) [6] which can be utilized when considering the business problem and managerial
objectives.
As DM is commonly referred to secondary data analysis, it is often assumed that a
ﬁxed amount of training data (being collected for some other purposes) is available for
the current goal of knowledge discovery. Consequently, it is assumed by many developers
of DM techniques that data is given and there are no costs associated with the availability
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

of the data. However, sooner or later is becomes evident that availability of data for
analysis (and especially the availability of labeled data for supervised learning) affects
the economic utility of acquiring training data (or labeling unlabeled data) and therefore,
should be considered as the costs of building a model, and applying the model.
Thus, e.g. in medical diagnostics the general problem can be formulated as: given
the costs of tests and the total fixed budget, decide which tests to run on with which
patients to obtain additional information needed to produce effective classifier (assuming
that no or little training data is available initially) [18]. Then, cost consideration includes
the costs associated with building the classifier and the costs (and benefits) associated
with applying the classifier.
Evaluation of cost-sensitive learners was studied in [22], where cost curves which
enable easy visualization of average performance (expected cost), operating range, confi-
dence intervals on performance, and difference in performance and its significance were
introduced.
Thus, it seems that at least most of the current research in UBDM inclines to cost-
sensitive learning and active learning paradigms from a machine learning perspective,
and refers total utility as derived from the following DM related processes:

data preparation3 – costs of acquiring the data, including primarily the costs of (1)
measuring an attribute value, (2) data labeling for supervised learning, (3) data records
collection/purchase/retrieval, and (4) data cleaning and preprocessing;
data modeling and evaluation – costs of searching for patterns in the data, costs of
misclassification, benefits of using the discovered patterns/models4 .
Thus, deployment and impact estimation of use oriented steps of the KDD are
currently almost completely ignored in (UBDM) DM research.
Even if UBDM researchers say that the goal of UBDM is to act so as to maximize
the total benefit of using the mined knowledge minus the costs of acquiring and mining
the data, yet it does not assume the thorough analysis of use-oriented steps of the KDD
process and accounting for various benefits (and risks) associated with them. The mined
knowledge is of utility to a person or an organization if it contributes to reach a desired
goal. Utility based measures in itemset mining use the utilities of the patterns to reflect
the user’s goals. Yao et al. [41] review utility based measures for itemset mining and
present a unified framework for incorporating several utility based measures into the DM
process by defining a unified utility function. Objective and subjective interestness of the
results of the association analysis have been studied by several authors from different
perspectives [25][7][27][33]. Yet, it is always assumed that there is a (single type of) user
and that the user is able to clearly formulate business challenges and to help to find an
appropriate transformation of them into a set of DM tasks or simply pick up one of the
suggested solutions.
In the following section we consider different types/levels of DM use and DM stake-
holders emphasizing the differences in utility considerations depending on the type of
DM use or the type of DM stakeholder.

2. Broadening the concept of utility in the context of UBDM

DM is fundamentally an application-oriented area motivated by business and scientiﬁc

needs to make sense of mountains of data [40]. Thus it is essential to make research
related to the use of DM systems, considering impacts and the essential factors that
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

have effect to the impacts. In the CRISP-DM model considered in the previous section
the starting point is set as the business understanding [8]. A closer look at that part of
model is represented in Figure 2 and discussed below. The initial phase of the model
focuses first on understanding the objectives and requirements for the DM project from a
business perspective. After that the project is considered in details as a DM problem and
a preliminary plan is designed to achieve the objectives.
In the CRISP-DM model the first phase aiming at thoroughly understanding the
user’s true business needs is a very challenging one. Often users have many competing
and even conflicting objectives which need to be uncovered and balanced at the greatest
possible extent in the very beginning of the modeling. Recognizing this need is a good
starting point but as well known in IS field the needs of users are commonly hard to
discover. In the CRISP-DM model the next phases are to describe the customer’s primary
3 Data preparation step can be associated not only with data preprocessing, data selection and data transfor-
mation processes, but also with data collection, data acquisition and/or data labeling.
4 Currently this direction is limited to accounting for some economic benefits, known (or believed to be
known) in advance without estimation of actual individual or organizational impact using the DM artifact.

Figure 2. Business understanding step in CRISP-DM [8] (p.16)

objective and the criteria for a successful or useful outcome of the project from a business
perspective. These criteria commonly include things to be evaluated subjectively but may
include some aspects which can be even measured objectively.
In the CRISP-DM model the next phase includes deep evaluation of the starting
situation including many aspects. In addition to the possible assumptions that have been
made about the data to be used during DM, also the possible key assumptions about
the business are important. Especially if they are related to conditions on the validity
of the results and legal aspects beside ordinary ones (as resources available, constraints,
schedule, and security). A very important part of the evaluation is to consider the possible
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

risks and the actions to be taken if risks realize.

It is essential to have a strict terminology in order to keep discussions consistent and
understandable to all involved. This aspect is raised also in CRISP-DM already in the
very beginning. As usual in any IS project the different parties, users and developers have
their own terminologies and they should manage to find a commonly understandable
language for communication.
Interestingly in CRISP-DM the cost-benefit analysis of the project is included in this
phase and even before considering DM goals which might also, in our understanding, be
taken under costs and benefits considerations.
DM goals are determined during the third phase of the CRISP-DM model (Figure 2)
using technical terms defining the outputs that make possible to achieve the business
goals. It is interesting that even when CRISP-DM gives value to business goal and de-
ployment as examples of the DM success criteria the predictive accuracy or a propensity
to purchase profile with a given degree of “lift” are mentioned [8] (p.17). It is mentioned
that also these success criteria can be subjective and then there is a need to identify the
person(s) who have made judgments.

The final phase of the business understanding step in the CRISP-DM model (Fig-
ure 2) is producing a project plan. This plan needs to include beside the typical elements
of a project plan (as duration, required resources, inputs, outputs and dependencies of
each step) also DM specific elements (as large-scale iterations of the modeling and eval-
uation phases with evaluation strategy and analysis of dependencies between time sched-
ule and risks with actions and recommendations if the risks appear). Project plan also
includes an initial selection of tools and techniques to be applied.
The above CRISP-DM model besides giving normative advices for the DM process
also mentions many important business related utility aspects. We would like to stress
here again that because the majority of DM efforts omit utility considerations, most of the
state-of-the-art DM artifacts do not allow searching directly for descriptive and predictive
models by specifying desired utility-related parameters.
One typical approach employed in practice is the use of feedback from domain ex-
perts on whether they find DM artifact and what it outputs to be useful (insightful, ap-
plicable, actionable, transparent, understandable etc.). The feedback is used to adjust the
data preprocessing steps or the parameters of a data modeling technique (or selection
of a particular technique) until an acceptable solution satisfying major expectations of
experts is found [29].
One way or another, there is a necessity to study what factors affect user acceptance
of DM artifacts in general and some learned models in particular. Pazzani in [29], states
that studying how people assimilate new knowledge could help the DM community to
design better KDD systems; in particular, instead of generating alternatives and testing
them for utility-related criteria, KDD systems would bias the search toward models that
meet these criteria. An interesting work in this direction is [3] where the authors were
trying to answer the question of what makes a discovered group difference insightful,
approaching two concrete research questions, “Is a discriminative or characteristic ap-
proach more useful for describing group differences?” and “How do subjective and ob-
jective measures of rule interest relate to each other?” by conducting a user (domain
experts) study. Unfortunately, such studies constitute an inessential minority of research
efforts in DM related areas.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

In the Information System (IS) discipline the IS success research has been practiced
for quite long and for example the most widely known Delone and McLean Success
Model [11] was based on reviewing 180 earlier studies. It has served since that as a ref-
erence model for many additional studies and Delone and McLean in their ten year sur-
vey [12] found more than one hundred research reports explicitly referencing the success
model. The enhanced model is included in Figure 3.
Bokhari in [5] used meta-analysis collecting a set of 55 papers from major journals,
conference proceedings, books and dissertations within the period 1979 to 2000 and tried
to explain and ﬁnd empirical evidence about the relationship between system usage and
user satisfaction. These both have long been considered part of success metric but the
research related to the nature of their relationship has failed to reach an agreement about
the strength and nature of it [5]. Of course there are also articles discussing success
factors of certain kind of systems [34][23].
Maybe the ﬁrst efforts to consider success-factors of the DM systems are the ones
presented in the DM Review magazine [9][19] including practice based success-factor
considerations. Coppock in [9] analyzed the failure factors of DM-involved projects. He
names four: (1) persons in charge of the project did not formulate actionable insights, (2)

System
Quality Use

Individual
Information Impact
Quality
User
Satisfaction
Service Organizational
Quality Impact

Figure 3. Adapted from D&M IS Success Model [11] (p.87) and updated D&M IS Success Model [12](p.24)

the sponsors of the work did not communicate the insights derived to key constituents, (3)
the results don’t agree with institutional truths, and (4) the project never had a sponsor
and champion. The main conclusion of Coppock’s analysis is that as in an IS the leader-
ship, communications skills and an understanding of the culture of the organization are
not less important than the traditionally emphasized quality of data and technological
skills to turn data into insights.
Hermiz in [19] communicated his beliefs that there are four critical success factors
for DM projects: (1) having a clearly articulated business problem that needs to be solved
and for which DM is a proper tool, (2) insuring that the problem being pursued is sup-
ported by the right type of data of sufﬁcient quality and in sufﬁcient quantity for DM, (3)
recognizing that DM is a process with many components and dependencies – the entire
project cannot be “managed” in the traditional sense of the business word, and (4) plan-
ning to learn from the DM process regardless of the outcome, and clearly understanding,
that there is no guarantee that any given DM project will be successful.
Lin in [40] notices that in fact there have been no major impacts of DM on the
business world echoed. However, even reporting of existing success stories is important.
Giraud-Carrier in [17] reported the summary of 136 success stories of DM, covering 9
business areas with references to 30 DM tools or DM vendors. Unfortunately, there was
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

no deep analysis provided to summarize or discover the main success factors.

Hevner et al. in [20] presented a conceptual framework for understanding, conduct-
ing and evaluation of the IS research. We adapt their framework to the context of DM
research (see Figure 4). The framework combines together the behavioral-science and
design-science paradigms and shows how research rigor and research relevance can be
explained, evaluated, and balanced.
We follow [20] with the description of the figure, emphasizing issues important in
DM. The environment defines not only the data that represents the problem to be mined
but people, (business) organizations, and their existing or desired technologies, infras-
tructures, and development capabilities. Those include the (business) goals, tasks, prob-
lems, and opportunities that define (business) needs, which are assessed and evaluated
within the context of organizational strategies, structure, culture, and existing business
processes. Those research activities that are aimed at addressing business needs con-
tribute to the relevance of research.
In a recent report [13] following the lively discussions about rigor and relevance
aspects going on in IS society during 2003 and 2004 the participants (authors of each
article) answered few questions related information systems research. One of them was

Figure 4. New research framework for DM research (adapted from [20])

the definition of “significant research” in IS area. All participants were strongly giving
value to the influences that IS has outside the IS research society. Among answers some
even wanted to have a very broad view including solving societal problems and creating
wealth among the significant research when some others were more concerned the issues
related to the more close stakeholders and around the IS field. In DM research we have
still a lot to do to “take good care of our own backyard” as El Sawy [13] (p.343) expressed
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

his opinion about the key characteristics of the IS research that really matters. In the next
section we continue broadening the utility considerations to the different groups of users
which have many different utility preferencies.

3. Generic framework for utility considerations of DM research

To consider the different utility aspects of DM research we ﬁrst consider the possible
understandings of the group of stakeholders of DM research. As a stakeholder of DM
research we understand a person or an organization that has a legitimate interest in DM
research or its results. We divide the stakeholders of DM research into two main groups:
(1) internal stakeholders that are stakeholders within academia, and (2) external stake-
holders that are all the others outside academia (as usual in IS discipline and suggested
to bring in the DM research in [32]).
Related to IS research, some authors stress the need to recognize its stakeholders
[21] (p.249). They mention as external stakeholders of publicly-funded IS research the
following: industry shareholders and their agents (management), the employees of ﬁrms

and organizations, their agents (unions), community and other levels of government and
the general public. Beside external stakeholders they refer to [4] that IS researchers have
important stakeholders within academia, as funding agencies, colleagues in other disci-
plines, university administrators, and students.
When the panelists were discussing IS research that really matters they were also
asked their opinion about IS research stakeholders [13]. Several traditional aspects came
up, as the academic community (professional peers, students, journals which regulate
and disseminate publication of research, as well as academic research funding institu-
tions), the business community (managers and professionals including consultants who
use IS to manage, as well as those who design, build, and manage IS), and the non-proﬁt
organization community. There were also opinions about more broad interpretation of
stakeholders including all those who are effected by IS (all human beings and even the
entire human race around the world now and in future) and concerns about freedom to
select, run, and publish IS research.
The main publicly funded DM research has concentrated on the development of
new algorithms or their enhancements and left the DM developers in domain areas to
take into account for example the cost considerations: investment in research, product
development, marketing, and product support (Lin in [40]). However, we have raised the
questions, “Is it reasonable that DM researchers leave the study of the DM development
and DM use processes totally outside their research area?”, “Are these equally important
aspects going to be handled better by the researchers of other areas, and DM researchers
should also in the future concentrate on the technological aspects of DM only?” [32]. In
any case it is evident that DM research has both external and internal stakeholders, as IS
do and DM researchers themselves need to decide which ones of the stakeholders and
their utilities are considered in future by DM researchers.
After recognizing the stakeholders it is necessary to consider what the relevancy
means for them. This is required to be able to consider their utilities with respect to re-
search. As in IS research, [21] focus is on the most commonly espoused group, the in-
dustry management considering two subgroups: senior management and the practition-
ers in IS departments. The internal stakeholders consider also two groups: IS research
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

community and academics in other disciplines. In discussions concerning internal stake-

holders the relevance for academics from other departments is considered to be at least as
important as the relevance of external stakeholders. This is because the other academics
participating academic communities and funding organizations are able to control the
advancement of IS researchers and their ﬁeld as a whole.
Of course also for DM research the internal stakeholders of DM research (DM re-
search community and academics of other disciplines) are very important. For the re-
searcher in the same area the rigor aspect of research is dominating and thus it is the main
criteria for relevance of research for them, too. Nowadays the widespread utilization of
IT within diverse industries (manufacturing, health care, education, etc.) has also raised
the interest of other academics to the results of IT-related research, and DM research, too.
This means that we encounter a growing pull as DM researchers to make rigor research
that at the same time produces useful results for researchers of other scientiﬁc areas.
Cresswell in [10] (p.2) writes that “One rather critical distinction is between rele-
vance to and serves the interests of or is value to” when he considers the relevance of
IS research. Some others [21] have kept this distinction as a starting point in exploring
the meaning of relevance because for example some research which is not directly con-

S. Puuronen and M. Pechenizkiy / Towards the Generic Framework for Utility Considerations 59

sidering stakeholder’s interests might still not be irrelevant. It is further noticed that dif-
ferent stakeholder groups tend to possess conflicting interests arising from their different
value systems. Thus IS research relevancy depends on judgments that should be made
explicit. Hirschheim and Klein in [21] (pp.250-253) have recognized that both the busi-
ness community and the academic community have not managed to justify their expec-
tations about IS research. They blame the IS research community has done a very poor
job of communicating and add that if the IS researchers truly believe that their theories
are relevant for practitioners they should communicate their results better. On the other
hand “the view of IS held by IS-practitioners is at best only partially supported by some
theories that guide IS research” (ibid. 253) and the view of non-IS practitioners is still
“even more at odds”.
In his introduction of JAIS special section on IS research perspectives former se-
nior editor Straub made recently a reference in [36] to [1] where authors have classified
knowledge sources into three broad dissemination types: (1) academic, (2) practitioner
or professional, and (3) academic-practitioner. He further refers to [2] and [37] that they
have argued that academics differ from practitioners in their focus on conceptual clarity
concluding “that managers who value definitional clarity will need to seek out academic
venues since little of this information [i.e. clear definitions of concepts] will be found
elsewhere” [36] (p.242).
We consider DM as an inseparable essential part of the knowledge discovery pro-
cess, and think that a more holistic view is needed in DM research. If this is accepted,
the DM researchers have to take under investigation also more and more utility-related
topics in larger scales. Simple assessment measures like predictive accuracy have to give
way to economic utility measures, such as profitability and return on investment beside
the more narrow economic ones as in cost-sensitive learning. In the following we con-
centrate to consider only the most important stakeholders and suggest the use of a generic
framework for utility considerations. This framework is based on multi-attribute additive
function [24] represented below:

m
m
V (ai , w) = w j v j (ai ), wher e w j = 1, andw j ≥ 0. (1)
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

j=i j=i

The v j (ai ) is the value of an alternative DM project (later alternative, for short)
ai for criteria v j and its weight when the global value V (ai , w) of the alternative ai is
calculated is w j . The weights indirectly reﬂect the importance of the criteria.
Evaluating alternative DM projects (alternatives ai ) as investments can be based on
different investments appraisal techniques which attach values of alternatives (v j (ai ))
taking different criteria into account. One traditional investments appraisal technique is
Parker et al’s [28] information economics considering two domains: business and tech-
nology. The business domain includes the following factors: (1) Return on investment,
(2) Strategic Match, (3) Competitive Advantage, and (4) Organizational risk. The tech-
nology domain includes the factors: (1) Strategic architecture alignment, (2) Deﬁnitional
uncertainty risk, (3) Technical uncertainty, and (4) Technology infrastructure risk.
These factors have been regrouped in [35] into two main criteria: value and risk. In
this new structure, the value criteria contains business factors except the Organizational
risk and one domain factor, Strategic architecture alignment. The risk criteria includes
the other four factors. They have further extracted 27 detailed criteria from IT/IS man-

Figure 5. The generic framework for utility evaluation from a DM stakeholder point of view.
(w1 , w2 , w11 , ..., w1n , and w21 , ..., w2n are weights)

agement literature and demonstrated one possible setting of these detailed criteria under
the main ones. They have further changed the multi-attribute additive function following
multi-criteria utility theory approach where the values of alternatives (v j (ai ) in the above
formula) are scaled into utilities on an interval from zero to one.
The four level selection model, where the top level (level 1) is giving net utility
for each project (V (ai , w) in the formula) based on their two equally weighted main
criteria in level 2: value and risk of the project alternative. Both the value and risk main
criteria are composed of four level 3 sub-criteria (mentioned above) both with their own
weight structure. The detailed criteria are at level 4 and the utility value of each level 3
sub-criteria is decided taking into account 2-4 of them using different weight structures
[35].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

We enhance this simple and flexible tree structure into another structure where dif-
ferent sub-criteria can effect more than one main criterion and allow this approach to
go down to the lower levels too. In our new generic multi-criteria utility based frame-
work (MCUF) the weights of the two main criterion w1 and w2 (see Figure 5) can be
fixed taking into account the context of decision making and the preferences of the stake-
holder (for example is he risk seeking or risk averse). Also the weights w11 , ..., w1n and
w21 , ..., w2n used to calculate the utility values of the main criteria based on the utility
values of the sub-criteria are allowed to be fixed in a context and stakeholder dependent
way. The weight of some criterion is allowed to be zero as appears in the examples of
next chapter when some sub-criteria are dropped from the corresponding figure.

4. DM processes revisited: UBDM perspective

In this section we present the DM stakeholders’ views on utility considerations (Fig-

ure 6). We provide also three intuitive examples which aim to demonstrate the different
priorities of different stakeholders in utility considerations.

Figure 6. DM Stakeholders view on utility considerations

Figure 6 has roots in CRISP-DM (remember Figure 1). However, here we emphasize
the importance of Use process that can start after there exists a ready-to-use DM artefact
(that is the result of the development, implementation, evaluation, and deployment of
certain DM solution that addresses a recognized business challenge).
We emphasize also that the use process is connected to a certain type(s) of DM
stakeholders (in the middle of Figure 6). They may have different (and potentially com-
peting) business challenges. Their use of DM artifact leads to certain individual and/or
organizational impacts (remember the success model from Figure 3), which needs to be
evaluated for utility considerations.
In Figures 7 – 9 we show how our generic framework (MCUF) accounts for the
different utility considerations with regard to the different groups of the DM stakehold-
ers. In Figure 7 we demonstrate one hypothetical example of utility consideration by top-
management in some organization (i.e. external customer, who can adopt potentially DM
artifact for managerial decision making) in a hypothetical situation when she is deciding
about DM project. Her main criteria are value and risk. The traditional investments ap-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

praisal technique is information economics considering two domains: business and tech-
nology domain [28]. Those typical eight criteria (sub-criteria in Figures 7) are affecting
the main criteria (i.e. value or risk) with corresponding weigh wi j , i = 1, 2; j = 1, ..., 8.
We emphasize with zero weights the fact that some sub-criteria may impact only one
group of main criteria, .
In Figure 8 utility consideration of a domain expert is presented (i.e. another type
of an external customer group, who can use DM artifact for decision support in daily
operational decision making e.g. in diagnostics). We highlight here such sub-criteria that
impact the overall utility from DM tool from the domain expert point of view such as:
satisfaction from use, possible changes in responsibilities for made decisions, whether
tool is transparent in functionality and the results are easy to interpret, whether training
and support is likely to be needed (and provided), and ﬁnally, what the overall impact
from the use of the tool will be. We can see that these utility considerations (in fact,
potentially related to the same DM artifact) differ quite a lot from the ones of the previous
group of stakeholders.
Figure 9 illustrates a different type of example when an editor (or a peer reviewer)
of a DM journal (the example of an internal customer) needs to decide whether to accept

Figure 7. Example of utility consideration by top-management in some organization when deciding on DM

project selection (zero weights emphasize the fact that some sub-criteria may impact only one group of main
criteria, i.e. value or risk)
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 8. Example of utility consideration by a domain expert when deciding whether to (continue to) use a
DM artifact (weights omitted from ﬁgure).

a submitted paper or not. Here, also a set of sub-criteria can be thought that impact major
value or risk criteria: how relevant to the scope of the journal the paper is, how rigor the
methods are, how relevant the results are, what the impact to credibility of the journal
and DM ﬁeld of this paper would be, whether research ethics is not violated and is paper
contents comprehensible.

5. Conclusions

Strong expectations have been lately loaded to the DM to help organizations and indi-
viduals get more utility from their databases and data warehouses. These expectations
are more based on the ﬁne rigor research results achieved with technical aspects of DM
methods and algorithms than vast amount of practical success stories. The time when DM
research has to answer also the practical expectations is fast approaching. Who should

Figure 9. Example of utility consideration by an editor of an DM journal deciding on the acceptance of the
submitted DM paper (weights omitted from ﬁgure).

take care about research of users’ (both individuals’ and organizations’) goals and suc-
cess factors when they install and use DM (system) parts in their ensembles of IS func-
tionalities? Can this research be left to IS researchers or do the researchers of these topics
need some amount of DM knowledge also?
The goal of this paper is to raise these broader UBDM questions under discussion
of researchers and practitioners in the DM area. We ﬁrst made a short review of the DM
research that has taken utility into account. The review is by no means covering all the
papers published but showing the main lines worked with. Then we took a closer look at
practice oriented normative advices included into the CRISP-DM model to motivate the
use and user orientation more broadly in the utility considerations. We shortly discussed
a well known success model and design science approach applied in IS discipline. We
considered more closely the stakeholders of DM research and especially their utilities.
We suggested a sketch of a new generic multi-criteria utility based framework (MCUF)
for more detailed analysis of different stakeholders’ utilities in their contexts. Also some
illustrative examples of the use of the framework in hypothetical context were given. The
framework still needs to be researched in various real application contexts in order to
validate it.
In our future work we plan to focus on a meta-analysis of the DM research tracing
its development, and to produce categorisations based on theory/practice orientation of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

examined DM research, the use of different kinds of research methods, and other criteria.
We plan to estimate approximately the proportions of published work in different
directions and different types of DM research through literature review and papers cat-
egorisation according to predefined classification criteria. Beside the analysis of the rel-
evant literature from the top international data-mining related journals and international
conference proceedings, we plan to collect and analyse the editorial policies of these
top international journals and conferences. This will result in better understanding of the
major findings and trends in the DM research area. We expect that this will also help us
to highlight the existing unbalance in the area, and suggest the ways of improving the
situation.

Acknowledgements

This research was partly supported by the Academy of Finland. We would like to thank
the reviewers for their constructive and detailed comments.

References
[1] N. J. Adler and S. Bartholomew. Academic and professional communities of discourse: Generating
knowledge on transnational human resource management. Journal of International Business Studies,
23(3):551–569, 1992.
[2] R. P. Bagozzi, Y. Yi, and L. W. Phillips. Assessing construct validity in organizational research. Admin-
istrative Science Quarterly, 36(3):421–458, 1991.
[3] S. D. Bay and M. J. Pazzani. Discovering and describing category differences: What makes a discovered
difference insightful. In Proc. of the 22nd Annual Meeting of the Cognitive Science Society, 2000.
[4] A. Bhattecherjee. Understanding and evaluating relevance in IS research. Communications of the Asso-
ciation for Information Systems, 6(6), 2001.
[5] R. Bokhari. The relationship between system usage and user satisfaction: a meta-analysis. The Journal
of Enterprise Information Management, 18(2):211–234, 2005.
[6] A. P. Bradley. The use of the area under the roc curve in the evaluation of machine learning algorithms.
Pattern Recognition, 30(7):1145–1159, 1997.
[7] D. Carvalho, A. Freitas, and N. Ebecken. Evaluating the correlation between objective rule interesting-
ness measures and real human interest. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama,
editors, Proc. of the 16th European Conf. on machine learning and the 9th European Conf. on principles
and practice of knowledge discovery in databases ECML/PKDD-2005, pages 453–461. Springer, 2005.
[8] P. Chapman, J. Clinton, R. Kerber, T. Khabaza, T. Reinartz, C. Shearer, and R. Wirth. CRISP-DM 1.0
Step-by-step data mining guide. The CRISP-DM consortium, 2000.
[9] D. S. Coppock. Data mining and modeling: So you have a model, now what? DM Review Magazine,
2003.
[10] A. Cresswell. Thoughts on relevance of is research. Communications of the Association for Information
Systems, 6(9), 2001.
[11] W. DeLone and E. McLean. Information systems success: The quest for the dependent variable. Infor-
mation Systems Research, 3(1):60–95, 1992.
[12] W. DeLone and E. McLean. The delone and mclean model of information systems success: A ten-year
update. Journal of MIS, 19(4):9–30, 2003.
[13] K. Desouza, O. El Sawy, R. Galliers, C. Loebbecke, and R. Watson. Beyond rigor and relevance towards
responsibility and reverberation: Information systems research that really matters. Communications of
the Association for Information Systems, 16(16):341–353, 2006.
[14] P. Domingos. Metacost: a general method for making classiﬁers cost-sensitive. In Proc. of the 5th ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, 1999.
[15] W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Adacost: misclassiﬁcation cost-sensitive boosting. In
Proc. 16th International Conf. on Machine Learning, pages 97–105. Morgan Kaufmann, 1999.
[16] U. Fayyad. Data mining and knowledge discovery: Making sense out of data. IEEE Expert, 11(5):20–25,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1996.
[17] C. Giraud-Carrier. Success Stories in Data/Text Mining. Brigham Young University, 2004.
[18] R. Greiner. Bugdeted learning of probabilistic classifiers. In Proc. of the Workshop on Utility-Based
Data Mining (UBDM’06), Invited Talk, 2006.
[19] K. Hermiz. Critical success factors for data mining projects. DM Review Magazine, 1999.
[20] A. R. Hevner, S. T. March, J. Park, and S. Ram. Design science in information systems research. MIS
Quarterly, 28(1):75–105, 2004.
[21] R. Hirschheim and H. Klein. Crisis in the IS field? A critical reflection on the state of the discipline.
Journal of the Association for Information Systems, 4(10):237–293, 2003.
[22] R. C. Holte and C. Drummond. Cost-sensitive classifier evaluation. In Proc. of the 1st Int. Workshop on
Utility-Based Data Mining, UBDM ’05, pages 3–9. ACM Press, 2005.
[23] M. Kamal. It innovation adoption in the government sector: Identifying the critical success factors. The
Journal of Enterprice Information Management, 19(2):192–222, 2006.
[24] R. Keeney and H. Raiffa. Decisions with Multiple Objectives: Preferences and Value Tradeoffs. Wiley:
New York, 1976.
[25] D. Luo, L. Cao, C. Luo, C. Zhang, and W. Wang. Towards business interestingness in actionable knowl-
edge discovery. In C. Soares, Y. Peng, J. Meng, T. Washio, and Z.-H. Zhou, editors, Applications of
Data Mining in E-Business and Finance, pages 99–109. IOS Press, 2008.
[26] G. Melli, O. R. Zaïane, and B. Kitts. Introduction to the special issue on successful real-world data
mining applications. SIGKDD Explorations, 8(1):1–2, 2006.

[27] M. Ohsaki, H. Abe, S. Tsumoto, H. Yokoi, and T. Yamaguchi. Evaluation of rule interestingness mea-
sures in medical knowledge discovery in databases. Artificial Intelligence in Medicine, 41(3):177–196,
2007.
[28] B. R. Parker, M. and H. Trainer. Information Economics: Linking Business Performance to Information
Technology. Prentice-Hall, Englewood Cliffs, NJ., 1988.
[29] M. Pazzani. Knowledge discovery from data? IEEE Intelligent Systems, 15(2):10–13, 2000.
[30] M. Pechenizkiy, S. Puuronen, and A. Tsymbal. Why data mining does not contribute to business? In
C. S. et al, editor, Proc. of Data Mining for Business Workshop, DMBiz (ECML/PKDD’05), pages 67–
71, 2005.
[31] F. Provost and T. Fawcett. Analysis and visualization of classifier performance: Comparison under
imprecise class and cost distributions. In Proc. of the 3rd Int. Conf. on Knowledge Discovery and Data
Mining, 1997.
[32] S. Puuronen, M. Pechenizkiy, and A. Tsymbal. Keynote paper: Data mining researcher, who is your cus-
tomer? some issues inspired by the information systems field. In Proc. of the 17th Int. Conf. on Database
and Expert Systems Applications DEXA’06, pages 579–583. IEEE Computer Society, Washington, DC,
2006.
[33] A. Silberschatz and A. Tuzhilin. What makes patterns interesting in knowledge discovery systems. IEEE
Transactions on Knowledge and Data Engineering, 8(6):970–974, 1996.
[34] P. Soja. Success factors in erp systems implementations: Lessons from practice. The Journal of Enter-
price Information Management, 19(6):646–661, 2006.
[35] R. Steward and S. Mohammed. It/is projects selection using multi-criteria utility theory. Logistics
Information Management, 15(4):254–270, 2002.
[36] D. Straub. The value of scientometric studies: An introduction to a debate on IS as a reference discipline.
Journal of AIS, 7(5):241–246, 2006.
[37] L. Van Dyne, L. Cummings, and J. McLean Parks. Extra role behaviours: In pursuit of construct and
definitional clarity (a bridge over muffled waters). Research in Organizational Behavior, 17:215–285,
1995.
[38] S. Viaene and G. Dedene. Cost-sensitive learning and decision making revisited. European Journal of
Operational Research, 166:212–220, 2004.
[39] G. Weiss, M. Saar-Tsechansky, and B. Zadrozny. UBDM ’05: Proc. of the 1st Int. workshop on Utility-
based data mining. 2005.
[40] X. Wu, P. S. Yu, G. Piatetsky-Shapiro, N. Cercone, T. Y. Lin, R. Kotagiri, and B. W. Wah. Data mining:
How research meets practical development. Knowledge and Information Systems, 5(2):248–261, 2000.
[41] H. Yao and H. J. Hamilton. Mining itemset utilities from transaction databases. Data and Knowledge
Engineering, 59(3):603–626, 2006.
[42] B. Zadrozny, G. Weiss, and M. Saar-Tsechansky. Proc. of the 2nd Int. Workshop on Utility-based data
mining, UBDM ’06. 2006.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Customer Validation of Commercial

Predictive Models
Tilmann BRUCKHAUS1 and William E. GUTHRIE

Numetrics Management Systems, Inc.

Abstract. A central need in the emerging business of model-based prediction is to

enable customers to validate the accuracy of a predictive product. This paper
discusses how analysts can evaluate data mining models and their inferences from
the customer viewpoint, where the customer is not particularly knowledgeable in
data mining. To date, academia has focused primarily on the validation of
algorithms through mathematical metrics and benchmarking studies. This type of
validation is not sufficient in the business context, where organizations must
validate specific models in terms that customers can understand quickly and
effortlessly. We describe our predictive business and our customer validation
needs. To that end, we discuss examples of customer needs, review issues
associated with model validation, and point out how academic research may help
to address these business needs.

Introduction

Fielded applications of data mining usually require careful validation before

deployment. In the case of commercial applications, validation is particularly
important, and when predictive modeling technology is the foundation of an entire
company, the need for validation becomes a question of survival. In this emerging
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

business of predictive products and services, a new class of research problems,

motivated by real-world business needs, is materializing in the guise of validation by
the customer. Once an organization has established the business relevance of a
predictive model, customer validation of model accuracy is arguably the most critical
challenge in selling data mining-based products and services to customers. With this
premise, we share observations and lessons learned from practical experiences with a
business entirely focused on predictive modeling.
Numetrics Management Systems serves the semiconductor industry with products
and services based on predictive data mining technology to help our customers evaluate,
plan, control and succeed with semiconductor and system design projects. Our
products capture critical design parameters of finished projects, such as electrical and
transistor properties, and use data mining technology to predict key performance
indicators for new projects. These key performance indicators include cost, time-to-
market and productivity, and they cover related issues, such as design complexity,
effort, staffing and milestones. Numetrics is a pioneer in the emerging market for
predictive software and services, and we began assembling our industry database in

1
Corresponding author: Numetrics Management Systems, Inc., 20863 Stevens Creek Blvd.,
Suite 510, Cupertino, CA 95014; E-mail: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
T. Bruckhaus and W.E. Guthrie / Customer Validation of Commercial Predictive Models 67

1996. Since then, we have accumulated a rich history of experiences with creating
predictive products, as well as with selling and supporting them. Customers must be
confident that our applications give accurate results to rely on them for business-critical
decisions. Validating the predictions is therefore an essential step to acceptance.
Customer validation is similar to the traditional mathematical validation of data
mining algorithms and predictive models. However, in many ways, customer
validation comprises a superset of the difficulties and challenges of mathematical
validation. In our experience with applying data mining technology to real industry
data and actual business problems, data mining currently focuses predominantly on a
small fraction of the entire problem. Kohavi & Provost [1] capture our own assessment
of the situation well when they state:

“It should also be kept in mind that there is more to data mining than just building an
automated [...] system. […]. With the exception of the data-mining algorithm, in the current
state of the practice the rest of the knowledge discovery process is manual. Indeed, the
algorithmic phase is such a small part of the process because decades of research have focused
on automating it – on creating effective, efficient data mining algorithms. However, when it
comes to improving the efficiency of the knowledge discovery process as a whole, additional
research on efficient data mining algorithms will have diminishing returns if the rest of the
process remains difficult and manual. […] In sum, […] there still is much research needed –
mostly in areas of the knowledge discovery process other than the algorithmic phase.”

In this paper, we will explore the specific research needs, which relate to customer
validation of predictive models.

1. Background

Data mining experts and customers of data mining technology do not necessarily
share the same training and background. Data miners typically have thorough
knowledge of data mining as well as statistical training. For example, some of the
more widely read overview texts on data mining are Berry & Linoff [2], Han &
Kamber [3], Mitchell [4], Quinlan [5], Soukup & Davidson [6], and Witten & Frank [7].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Two insightful papers, which compare various methods of evaluating machine-learning

algorithms on a broad set of benchmarking data sets, provide a good introduction into
the more specialized field of model validation [8] and [9]. There is also a rich field of
research into cost-sensitive learning, for example, see [10], [11], [12], [13], [14], [15],
[16], [17], [18], [19], [20], [21], [22], [23], and [24].
For customers who are familiar with such literature, model validation techniques
that academia uses may be wholly appropriate. However, customers with expertise in
data mining are an exception rather than the rule. Most of our customers are engineers
and managers who do not necessarily have the background and skills one may take for
granted within the data mining community. Customers will therefore ask questions that
reflect the terminology and approaches of their own field of expertise, and data miners
may be hard-pressed to provide answers.

2. Customer approach to model validation

The customer view of model validation is different from the academic view
because one can make few assumptions as to the statistical and data mining savvy of
the customer. Customers must focus on their business needs, and in our case, these
needs are those of the semiconductor business. Our customers view our product not so
much as a predictive model, but rather as a tool that can answer questions about cost,
time-to-market, and productivity of a semiconductor design project. Our customers do
not concern themselves primarily with the predictive model that is the engine inside the
application. Instead, they focus on the controls of the application, and in order to
receive value from the product, they need the application to address their business
needs directly and immediately.
When our customers ask, “how accurate is your model?” they have a broad and
diverse mental picture of what “accuracy” means. Many of our customers are
engineers, so they expect a meticulous response. The following table lists some of the
questions our customers are likely to ask when they validate a predictive model:

Table 1: Questions focused on the Application Domain

1. Our organization has collected 2. I have experience with version 3. How accurate is the
effort metrics on 50 completed N of the model. With version application, and how is accuracy
projects. How accurately does N+1 available, should I migrate to measured?
the application rank the expected version N+1? Is it better? How
effort for those projects? much better is it?

4. I have experienced a 5. My organization operates in the 6. Can the application predict

challenging situation where the automotive semiconductor field, exactly how long my radio
team doubled the target frequency which is a highly specialized frequency design project will
of a chip in order to address market. Can the application take?
market needs. How does the predict accurately in this specific
doubling of frequency affect the environment?
predicted effort for the design?

7. I know from experience that 8. I do not track Ring Oscillator 9. The application asks for clock
the number of capacitors on an Delay, but the application speed but my design is pure
analog design relates to effort. requires this input. Will the analog and has no clocks. What
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

How can the application predict application still be useful without should I enter? Is this model
effort accurately when we cannot this input, and how sensitive is useful to me?
enter the number of capacitors the application to inaccurately
into the application? entered data?

It is apparent that our customers’ questions are mostly specific to the domain of
semiconductor design. Even when our customers do not ask their questions explicitly
in the terminology of the semiconductor business, they would still like to obtain
answers in semiconductor design-project terminology. For example, when our
customers ask: “How accurate is the application, and how is accuracy measured?” they
would prefer an answer that uses their terminology, like “X% of projects complete
within Y% of the predicted completion date” to an answer which does not use their
terminology, like “The F-Score is X.” For comparison, the next table lists questions
that focus on data mining technology:

Table 2: Questions focused on Data Mining Technology

10. What is the area under the 11. What is the optimal number of 12. What is the Lift for this
Receiver-Operating- boosting operations? model?
Characteristic Curve?

13. What is the F-Score? 14. What is the Cross Entropy of 15. How well would the
the model? application perform on the Iris
Data Set [25]?
16. How imbalanced was the 17. Where is the precision/recall 18. Does the application use a
training data set? break-even point? Support Vector Machine?

These two sets of questions illustrate that customers think in terms of their field of
application rather than in terms of data mining, and more importantly, it is often not
clear how to translate one language into the other.
In addition to a desire for familiar terminology, there are other peculiarities about
customer validation. Some customers are more interested in “white box” validation
whereas others may be more interested in “black box” validation, where white box
validation considers how the model operates, and black box validation only considers
the behavior displayed by the application. Both types of customers would like to
receive answers to their questions. A further complication however, is that customers
who are interested in white box validation may want to understand the model in terms
of engineering equations they are familiar with, and they might want to see a formula
which describes how a specific input of interest affects the predictive output of the
model.
A related question arises in the context of planning: “how accurate is the
application’s estimate of effort for a specific new project that will be critical to the
customer’s future success?” This question cannot be answered based on a single project
(one observation), because during the project’s planning stage it is obviously
impossible to compare its predicted effort to its actual effort. Moreover, we must
measure model accuracy for a population of cases and although the model may be very
accurate across an entire population, it may provide less accurate results for a single
case. Generally, it is not clear whether and how it might be possible to obtain accuracy
estimates for specific cases. Such individual cases, or use cases, may be pivotal to the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

customer’s business and to their decision of adopting a predictive application. Another

key caveat is that semiconductor design projects evolve throughout their life cycle.
What customers want to estimate at the planning stage may change during the project,
and it would be useful to account for this evolution in modeling and validation.

3. Analysis

The issue of customer-oriented evaluation has two components. The first is how to
place evaluations in the customer's language, which we will refer to as the “domain
language requirement”. The second component is perhaps more relevant to data
mining research: how to make sure that evaluations actually answer the wide range of
questions that customers will ask. We will refer to this requirement as the “evaluation
completeness requirement”.
One avenue to address the domain language requirement might be to use a generic
language that can be mapped more-or-less easily into the language for a particular

domain. How then do the different questions, in Table 1, correspond to such more
general questions in a more generic language? For example, Question-1 is a question
of applying the model to particular cases that the customer has in mind. Question-5
talks about the accuracy of the model on a particular subspace of the domain.
Question-8 talks about how to deal with missing values during model use, as opposed
to model construction. Question-9 talks about the inputs actually "used" by the model.
Table 3 provides such a mapping from customer language to machine learning
language for all sample questions listed in Table 1. In the left-hand column of Table 3,
we list customer’s domain-specific concepts related to the questions from Table 1. We
then match the fragments to approximately comparable ideas and approaches in
machine-learning language in the middle column. We also suggest some key machine-
learning concepts in the right-hand column, which appear to relate to the customer
concern in question.

Table 3: Analysis of Customer Validation Needs

Customer’s Domain- Analysis Key Machine-

Specific Concepts Learning Concepts

Estimations based on Reference to potential use cases or background expert Use Case,
first hand experience knowledge. To achieve customer acceptance and win Background Expert
and intuition: Table 1, their business, it may be particularly important that the Knowledge,
items 1, 4, 7, 9 model perform well on these use cases. It may be Training Cases,
possible to improve model performance by incorporating Validation Cases
appropriate background expert knowledge or by capturing
additional training cases.

Knowledge of relative Statistical analysis of ranking, such as Spearman’s rank Rank Correlation
actual outcomes: item correlation coefficient may be a good tool for evaluating
1 model performance.
Concern over risk Compare specific alternative models in terms of their Model comparison
associated with performance and quality.
improvement vs.
stability: item 2
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Business risk due to References to model quality; Some are more generic, Model Quality,
potentially inaccurate while others are more specific and identify project Target Variables
estimations: items 3, 5, duration as a target variable.
6

Intuitions about Consider use cases where a specific input changes, such Model Sensitivity,
expected model as “frequency”, and review the impact in terms of the Specific Inputs
behavior in response to sensitivity of the model.
changes in input
values: items 4, 7, 8

Awareness that unique There are reportedly clusters of cases in the input space Case Clustering,
cases within the where the model should perform differently from how it Sub-Models,
domain require special performs in other ranges. It may be helpful to use Stratification
treatment: items 5, 6, 7 unsupervised learning to discover such clusters and to
offer cluster membership as an input. Alternatively, one
may build different models or sub-models to address
different sub-domains.

Insights about which A variable considered important by the customer is not an Missing Variables,
parameters should be input to the model. Are there one or more “proxy” Proxy Variables,
used for estimation: variables in the model, which account for some of the Adding Inputs

items 4, 7 “missing” information? Is there an opportunity to build a

better model with additional inputs?

Required data cannot Estimation based on partial inputs; dealing with missing Missing Data in
be collected or does values; inapplicable inputs Scoring Records
not apply: items 8, 9

Certainly, our list of customer needs and questions and our mapping into the
machine-learning domain is not exhaustive. For example, some topics which we have
not addressed but which are equally important to customer validation of data mining
technology are the explanatory power of data mining models [26], the financial impact
of predictive models [27], and information retrieval-related customer needs [28].
Understanding customer concerns is a prerequisite for validating practical,
commercial data mining applications. In some cases, it may be best to address customer
validation needs by analyzing the output of a model, while in other cases it may be
possible to address customer validation needs directly inside of the data-mining
algorithm. For example, it may be possible to build predictive models while
specifically taking into account the sensitivity of the model to variations of the inputs,
as suggested by [29], and [30]. As we understand customer validation needs better,
researchers and practitioners will be able to address the evaluation completeness
requirement better. One method may be to select algorithms and model evaluation
procedures address specific customer validation needs by design.

4. Conclusions

In this paper, we have reviewed customer requirements for the evaluation of data
mining models. Common themes in customer model validation include:
• Sensitivity: the model's response to changes in input values, sign and
magnitude
• Range: the specific range where the models inputs are valid
• Parameters: which of the thousand possible factors does the model
incorporate directly, and which are covered by proxies, and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

• Robustness: how much of the real-world domain space is covered by the

model.
There are more, but these points need systematic explanation as to how to apply
tests and how to interpret results.
Defining accuracy in mathematical terms is very simple but capturing the various
needs and ideas that customers connect to accuracy is much more difficult. There are
machine-learning techniques available to address customer validation needs but a
comprehensive framework is lacking to date. Certainly one complication is the need to
get the evaluation into a form that customers can use. Customers will have to work
through various issues to validate new models, and they need support from the vendor.
In fact, what the vendor has to do to build and validate a model is exactly what a
customer has to do to evaluate the resulting model. It would be attractive to use a
common tool set for model building and to support customer validation. It seems
reasonable to invest in formalizing this process and sharing it with customers.
Although it would require analysis across domains, it would clearly be interesting to
understand how often these different sorts of evaluation questions arise.

It is generally not practical to train customers in data mining validation, and what
we need instead is technology for supporting customer validation in practical terms. It
appears that customers are interested in model-level accuracy, the effect of specific
inputs on model output, as well as in a great variety of domain-specific use cases. The
customer view of model validation is at once very similar and very different from the
data miner’s view, and it is our hope that technologies will evolve that will make it
easy to cross the chasm between the two.

References

[1] Kohavi, R., and Provost, F., January 2001, “Applications of Data Mining to E-commerce” (editorial),
Applications of Data Mining to Electronic Commerce. Special issue of the International Journal Data
Mining and Knowledge Discovery.
[2] Berry, M.J.A., and Linoff, G. 1997. Data Mining Techniques: For Marketing, Sales, and Customer
Support. John Wiley & Sons.
[3] Han, J and Kamber, M. 2005. Data Mining, Second Edition: Concepts and Techniques (The Morgan
Kaufmann Series in Data Management Systems)
[4] Mitchell, T. 1997. Machine Learning. McGraw-Hill Science / Engineering / Math; first edition.
[5] Quinlan, J.R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.
[6] Soukup, T. and Davidson, I. 2002. Visual Data Mining: Techniques and Tools for Data Visualization
and Mining. Wiley.
[7] Witten, I.H., and Frank, E. 2005. Data Mining: Practical machine learning tools and techniques.
Morgan Kaufmann, San Francisco. Second Edition.
[8] Caruana, R., and Niculescu-Mizil, A. 2004, Data Mining in Metric Space: An Empirical Analysis of
Supervised Learning Performance Criteria. In Proceedings of the Tenth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22-25, 2004.
[9] Caruana, R., and Niculescu-Mizil, A., 2006, An Empirical Comparison of Supervised Learning
Algorithms. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA.
[10] Bruckhaus, T., Ling, C.X., Madhavji, N.H., and Sheng, S. 2004. Software Escalation Prediction with
Data Mining. Workshop on Predictive Software Models (PSM 2004), A STEP Software Technology &
Engineering Practice.
[11] Chawla, N.V., Japkowicz, N., and Kolcz, A. eds. 2004. Special Issue on Learning from Imbalanced
Datasets. SIGKDD, 6(1): ACM Press.
[12] Domingos, P. 1999. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings
of the Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, ACM Press.
[13] Drummond, C., and Holte, R.C. 2003. C4.5, Class Imbalance, and Cost Sensitivity: Why under-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II.

[14] Elkan, C. 2001. The Foundations of Cost-sensitive Learning. In Proceedings of the International Joint
Conference of Artificial Intelligence (IJCAI 2001), 973-978.
[15] Fan, W., Stolfo, S.J., Zhang, J., and Chan, P.K. 1999. AdaCost: Misclassification Cost-sensitive
Boosting. In Proceedings of the Sixteenth International Conference on Machine Learning, 97-105.
[16] Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R. (editors). 1996. Advances in
Knowledge Discovery and Data Mining, AAAI/MIT Press.
[17] Japkowicz. N. 2001. Concept-Learning in the Presence of Between-Class and Within-Class Imbalances,
In Proceedings of the Fourteenth Conference of the Canadian Society for Computational Studies of
Intelligence (AI'2001).
[18] Joshi, M.V., Agarwal, R.C., and Kumar, V. 2001. Mining needles in a haystack: classifying rare classes
via two-phase rule induction. In Proceedings of the SIGMOD’01 Conference on Management of Data.
[19] Ling, C.X., and Li, C. 1998. Data Mining for Direct Marketing: Specific Problems and Solutions. In
Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-
98), 73-79.
[20] Ling, C.X., Yang, Q., Wang, J., and Zhang, S. 2004. Decision trees with minimal costs. In Proceedings
of International Conference on Machine Learning (ICML).
[21] Niculescu-Mizil, A., and Caruana, R. 2001. Obtaining Calibrated Probabilities from Boosting. AI Stats.
[22] Ting, K.M. 2002. An Instance-Weighting Method to Induce Cost-sensitive Trees. IEEE Transactions on
Knowledge and Data Engineering, 14(3):659-665.

[23] Weiss, G., and Provost, F. 2003. Learning when Training Data are Costly: The Effect of Class
Distribution on Tree Induction. Journal of Artificial Intelligence Research 19: 315-354.
[24] Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive Learning by Cost-Proportionate Example
Weighting. In Proceedings of International Conference of Data Mining (ICDM).
[25] Anderson, E. 1935, "The irises of the Gaspé peninsula", Bulletin of the American Iris Society 59, 2-5.
[26] Pazzani, M. J. (2000), „Knowledge Discovery from Data?” IEEE Intelligent Systems, March/April
2000, 10-13.
[27] Bruckhaus, T. 2007. The Business Impact of Predictive Analytics. Book chapter in Knowledge
Discovery and Data Mining: Challenges and Realities with Real World Data. Zhu, Q, and Davidson, I.,
editors. Idea Group Publishing, Hershey, PA
[28] Joachims, T., 2002. Optimizing search engines using clickthrough data. In Proceedings of the eighth
ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta,
Canada, pp 133 - 142
[29] Engelbrecht, A. P., 2001, “Sensitivity Analysis for Selective Learning by Feedforward Neural
Networks", Fundamenta Informaticae, 45(4), pp 295-328.
[30] Castillo, E., Guijarro-Berdiñas, B., Fontenla-Romero, O., Alonso-Betanzos, A., 2006, A Very Fast
Learning Method for Neural Networks Based on Sensitivity Analysis, Journal of Machine Learning
Research, 7(Jul), pp 1159-1182.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Customer churn prediction -

a case study in retail banking
Teemu MUTANEN a,1 , Sami NOUSIAINEN a and Jussi AHOLA a
a VTT Technical Research Center of Finland

Abstract. This work focuses on one of the central topics in customer relationship
management (CRM): transfer of valuable customers to a competitor. Customer re-
tention rate has a strong impact on customer lifetime value, and understanding the
true value of a possible customer churn will help the company in its customer rela-
tionship management. Customer value analysis along with customer churn predic-
tions will help marketing programs target more speciﬁc groups of customers. We
predict customer churn with logistic regression techniques and analyze the churn-
ing and nonchurning customers by using data from a consumer retail banking com-
pany. The result of the case study show that using conventional statistical methods
to identify possible churners can be successful.

Keywords. Churn, Customer churn, retail banking

Introduction

This paper will present a customer churn analysis in consumer retail banking sector. The
focus on customer churn is to determinate the customers who are at risk of leaving and
if possible on the analysis whether those customers are worth retaining. A company will
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

therefore have a sense of how much is really being lost because of the customer churn
and the scale of the efforts that would be appropriate for retention campaign.
The customer churn is closely related to the customer retention rate and loyalty.
Hwang et al. [8] deﬁnes the customer defection the hottest issue in highly competitive
wireless telecom industry. Their LTV model suggests that churn rate of a customer has
strong impact to the LTV because it affects to the length of service and to the future rev-
enue. Hwang et al. also deﬁnes the customer loyalty as the index that customers would
like to stay with the company. Churn describes the number or percentage of regular cus-
tomers who abandon relationship with service provider [8].

Customer loyalty = 1 - Churn rate

Modeling customer churn in pure parametric perspective is not appropriate for LTV
context because the retention function tends to be spiky and non-smooth, with spikes at
the contract ending dates [14]. And usually on the marketing perspective the sufﬁcient
information about the churn is the probability of possible churn. This enables the mar-
1 Corresponding Author, e-mail: teemu.mutanen@vtt.ﬁ

Table 1. Examples of churn prediction in literature.

article market sector case data methods used %
Au et al.[1] wireless 100 000 DMEL-method (data mining by
telecom subscribers evolutionary learning)
Buckinx et al.[2] retail 158 884 Logistic regression, ARD (automatic
business customers relevance determination), decision tree
Ferreira et al.[6] wireless 100 000 Neural network, decision tree
telecom subscribers HNFS, rule evolver
Garland [7] retail 1 100 multiple regression
banking customers
Hwang et al.[8] wireless 16 384 logistic regression, neural network,
telecom customers decision tree
Mozer et al.[12] wireless 46 744 logistic regression, neural network,
telecom subscribers decision tree
Keaveney et al.[9] online 28 217 descriptive statistics based on the
service records questionnaires sent to the customers

keting department so that, given the limited resources, the high probability churners can
be contacted first [1].
Lester explains the segmentation approach in customer churn analysis [11]. She also
points out the importance of the right characteristics studied in the customer churn anal-
ysis. For example in the banking context those signals studied might include decreasing
account balance or decreasing number of credit card purchases. Similar type of descrip-
tive analysis has been conducted by Keveney et al. [9]. They studied customer switch-
ing behavior in online services based on questionnaires sent out to the customers. Gar-
land has done research on customer profitability in personal retail banking [7]. Although
their main focus is on the customers’ value to the study bank, they also investigate the
duration and age of customer relationship based on profitability. His study is based on
customer survey by mail which helped him to determine the customer’s share of wallet,
satisfaction and loyalty from the qualitative factors.
Table 1 presents examples of the churn prediction studies found in literature: the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

analysis of the churning customers have been conducted on various ﬁelds. However,
based on our best understanding, no practical studies have been published related to retail
banking sector focused on the difference between continuers and churners.

1. Case study

Consumer retail banking sector is characterized by customers who stays with a company
very long time. Customers usually give their financial business to one company and they
won’t switch the provider of their financial help very often. In the company’s perspective
this produces a stabile environment for the customer relationship management. Although
the continuous relationships with the customers the potential loss of revenue because of
customer churn in this case can be huge. The mass marketing approach cannot succeed in
the diversity of consumer business today. Customer value analysis along with customer
churn predictions will help marketing programs target more specific groups of customers.
In this study a customer database from a Finnish bank was used and analyzed. The
data consisted only of personal customers. The data at hand was collected from time

Figure 1. Customer with and without a current account and their average in/out money is presented in different
channels. Legend shows the number of customers that has transactions in each channel. (A test sample was
used n=50 000).

period December 2002 till September 2005. The sampling interval was three months, so
for this study we had relevant data of 12 points of time [t(0)-t(11)]. In logistic regression
analysis we used a sample of 151 000 customers.
In total, 75 variables were collected from the customer database. These variables
are related to the topics as follows: (1) account transactions IN, (2) account transactions
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

OUT, (3) service indicators, (4) personal profile information, and (5) customer level com-
bined information. Transactions have volumes in both in and out channels, out channels
have also frequency variables for different channels.
The data had 30 service indicators in total, (e.g. 0/1 indicator for housing loan), for
example whether customer has housing loan or not. One of these indicators C1 described
the current account. The figure 1 shows the average money volumes in different channels
in two groups of customers on sample 1 as customers are divided into discriminated
based on the current account indicator.
As mentioned previously customers’ value to a company is at the heart of all cus-
tomer management strategy. In retail banking sector the revenue is generated by both
from the margins of lending and investment activities and revenues earned form ser-
vice/transactions/credit card/etc. fees. And as Garland noted [7], retail banking is char-
acterized by many customers (compared to wholesale banking with its few customers),
many of whom make relatively small transactions. This setup in retail banking sector
makes it hard to define customer churn based on customer profitability.
One of the indicators mentioned above, C1 tells whether the customer has a current
account in the time period at hand or not, and the definition of churn in the case study is

based on it. This simple definition is adequate for the study and makes it easy to detect
the exact moment of churn. The customers without a C1 indicator before the time period
were not included in the analysis. Their volume in the dataset is small. In banking sector
a customer who does leave, may leave an active customer id behind because bank record
formats are dictated by legislative requirements.
The definition of churn, presented above, produced relatively small amount of cus-
tomers to be considered churners. On average there were less than 0.5% customers in
each time step to be considered churners.
This problem has been identified in the literature under term class imbalance prob-
lem [10] and it occurs when one class is represented by a large number of examples
while the other is represented by only a few. The problem is particularly crucial in an
application, such as the present one, where the goal is to maximize recognition of the mi-
nority class [4]. In this study a down-sizing method was used to avoid all predictions turn
out as nonchurners. The down-sizing (under-sampling) method consists of the randomly
removed samples from the majority class population until the minority class becomes
some specific percentage of the majority class [3]. We used this procedure to produce
two different datasets for each time step: one with a churner/nonchurner ratio 1/1 and the
other with a ratio 2/3.
In this study we use binary predictions, churn and no churn. A logistic regression
method [5] was used to formulate the predictions. The logistic regression model gener-
ates a value between bounds 0 and 1 based on the estimated model. The predictive per-
formances of the models were evaluated by using lift curve and by counting the number
of correct predictions.

2. Results

A collection of six different regression models was estimated and validated. Models were
estimated by using six different training sets: three time periods (4, 6, and 8) with two
datasets each. Three time periods (t = 4, 6, 8) were selected for the logistic regression
analysis. This produced six regression models which were validated by using data sample
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3 (115 000 customers with the current account indicator). In the models we used several
independent variables, these variables for each model are presented in the table 2. The
number of correct predictions is presented in each model in the table 3. In the validation
we used the same sample with the churners before the time period t=9 removed and the
data for validation was collected from time periods t(9) - t(11).
Although all the variables in each of the models presented in the table 2 were sig-
nificant there could still be correlation between the variables. For example in this study
the variables Num. of transactions (ATM) are correlated in some degree because they
represent the same variable only from different time period. This problem that arises
when two or more variables are correlated with each other is known as multicollinearity.
Multicollinearity does not change the estimates of the coefficients, only their reliability
so the interpretation of the coefficients will be quite difficult [13]. One of the indicators
of multicollinearity is high standard error values with low significance statistics. A num-
ber of formal tests for multicollinearity have been proposed over the years, but none has
found widespread acceptance [13].
It can be seen from the table 2 that all the variables have very little variance between
the models. The only larger difference is in the variableś Number of services value in the

Table 2. Predictive variables that were used in each of the logistic regression models. Notion X1 marks for
training dataset with a churner/nonchurner ratio 1/1 and X2 for a dataset with a ratio 2/3. The coefﬁcients of
variable in each of the models are presented in the table.
Model 41 42 61 62 81 82
Constant - - 0.663 - 0.417 -
Customer age 0.023 0.012 0.008 0.015 0.015 0.013
Customer bank age -0.018 -0.013 -0.017 -0.014 -0.013 -0.014
Vol. of (phone) payments in t=i-1 - - - - 0.000 0.000
Num. of trasactions (ATM) in t=i-1 0.037 0.054 - - 0.053 0.062
Num. of trasactions (ATM) in t=i -0.059 -0.071 - - -0.069 -0.085
Num. of transactions (card payments) t=i-1 0.011 0.013 - 0.016 0.020 0.021
Num. of transactions (card payments) t=i -0.014 -0.017 - -0.017 -0.027 -0.026
Num. of transactions (direct debit) t=i-1 0.296 0.243 0.439 0.395 - -
Num. of transactions (direct debit) t=i -0.408 -0.335 -0.352 -0.409 - -
Num. services, (not current account) -1.178 -1.197 -1.323 -1.297 -0.393 -0.391
Salary on logarithmic scale in t=i 0.075 0.054 - - - -

Table 3. Number and % share of the correct predictions (mean from the time periods t=9, 10, 11). In the
validation sample there were a 111 861 cases. The results were produced by the models when the threshold
value 0.5 was used.
Model Number of correct % correct % churners in % true churners
predictions predictions the predicted set identiﬁed as churners
model 41 69670 62 0.8 75.6
model 42 81361 72 0.9 60.5
model 61 66346 59 0.8 79.5
model 62 72654 65 0.8 73.4
model 81 15384 14 0.5 97.5
model 82 81701 73 0.9 61.3

models 81 and 82 compared to the value in the rest of the models. Overall behavior in
the coefﬁcients is that the coefﬁcient half year before the churn has a positive sign and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

the coefficient three months before the churn has a negative sign. This indicates that the
churning customers are those that have declining trend in their transaction numbers. Also
a greater customer age and a smaller customer bank age have both positive impacts on
the churn probability based on the coefficient values.
The logistic regression model will generate a value between bounds 0 and 1 as pre-
sented in the chapter 3.1 based on the estimated model. By using a threshold value on
the discrimination of the customers there will be both error types of classification made.
A churning customer could be classified to a nonchurner and a nonchurning customer
could be classified as a potential churner. In the table 3 the number of correct predictions
is presented in each model. In the validation the sample 3 was used with the churners
before the time period t=9 removed.
The values in the table 3 are calculated using a threshold value 0.5. If the threshold
value would be for example set to 1 instead of 0.5 the % correct predictions would be
99.5 because all the predictions would be nonchurners and because there were only 481
churners (0.45%) on average in the validation set. The important result in the table 3 is
the column churners in the predicted set which tells the percentage of the true churners
in the predicted set when the threshold value 0.5 is used.

100

70
% churners identified

30
mod 41
mod 42
20
mod 6
1
mod 6
2
10 mod 81
mod 82
0
0 10 20 30 40 50 60 70 80 90 100
% customers identified

Figure 2. Lift curves from the validation-set (t=9) performance of six logistic regression models. Model num-
ber (4, 6, and 8) represents the time period of the training set and (1 and 2) represent the down-sizing ratio.

The important result found in the table 3 is the proportional share of true churners
to be identiﬁed as churners by the model. It is also seen in the table that the models
with a good overall prediction performance won’t perform so well in the predictions of
churners. The previously discussed classimbalance problem has an impact here.
The lift curve will help to analyze the amount of true churners that are discriminated
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

in each subset of customers. In the figure 2 the % identified churners are presented based
on each logistic regression models. The lift curves were calculated from the validation
set performance. In the table 3 the models 41 , 61 , and 62 have correct predictions close
to 60% where models 42 and 82 have above 70% of correct predictions. This difference
between the five models has vanished when amount of correct predictions is analyzed in
the subsets as is presented in the figure 2.

3. Conclusions

In this paper a customer churn analysis was presented in consumer retail banking sector.
The different churn prediction models predicted the actual churners relatively well. The
ﬁndings of this study indicate that, in case of logistic regression model, the user should
update the model to be able to produce predictions with high accuracy since the inde-
pendent variables of the models varies. The customer proﬁles of the predicted churners
weren’t included in the study.

It is interesting for a company’s perspective whether the churning customers are

worth retaining or not. And also in marketing perspective what can be done to retain
them. Is a three month time span of predictions enough to make positive impact so that
the customer is retained? Or should the prediction be made for example six months
ahead?
The customer churn analysis in this study might not be interesting if the customers
are valued based on the customer lifetime value. The churn deﬁnition in this study was
based on the current account. But if the churn deﬁnition was based on for example loyalty
program account or active use of the internet service. Then the customers at focus could
possibly have greater lifetime value and thus it would be more important to retain these
customers.

References

[1] Au W., Chan C.C., Yao X.: A Novel evolutionary data mining algorithm with applications to churn
prediction. IEEE Trans. on evolutionary comp. 7 (2003) 532–545
[2] Buckinx W., Van den Poel D.: Customer base analysis: partial detection of behaviorally loyal clients in a
non-contractual FMCG retail setting. European Journal of Operational Research 164 (2005) 252–268
[3] Chawla N., Boyer K., Hall L., Kegelmeyer P.: SMOTE: Synhetic minority over-sampling technique.
Journal of Artificial Research 16 (2002) 321–357
[4] Cohen G., Hilario M., Sax H., Hugonnet S., Geissbuhler A.: Learning from imbalanced data in surveil-
lance of nosocomial infection. Artificial Intelligence in Medicine 37 (2006) 7–18
[5] Cramer J.S.: The Logit Model: An Introduction. Edward Arnold (1991). ISBN 0-304-54111-3
[6] Ferreira J., Vellasco M., Pachecco M., Barbosa C.: Data mining techniques on the evaluation of wireless
churn. ESANN2004 proceedings - European Symposium on Artificial Neural Networks Bruges (2004)
483–488
[7] Garland R.: Investigating indicators of customer profitability in personal retail banking. Proc. of the Third
Annual Hawaii Int. Conf. on Business (2003) 18–21
[8] Hwang H., Jung T., Suh E.: An LTV model and customer segmentation based on customer value: a case
study on the wireless telecommunication industry. Expert Systems with Applications 26 (2004) 181–188
[9] Keaveney S., Parthasarathy M.: Customer Switching Behaviour in Online Services: An Exploratory
Study of the Role of Selected Attitudinal, Behavioral, and Demographic Factors. Journal of the Academy
of Marketing Science 29 (2001) 374–390
[10] Japkowicz N., Stephen S.: The class imbalance problem: A systematic study. Intelligent Data Analysis 6
(2002) 429–449
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[11] Lester L.: Read the Signals. Target Marketing 28 (2005) 45–47
[12] Mozer M. C., Wolniewicz R., Grimes D.B., Johnson E., Kaushansky H.: Predicting Subscriber Dissat-
isfaction and Improving Retention in the Wireless Telecommunication Industry. IEEE Transactions on
Neural Networks, (2000)
[13] Pindyck R., Rubinfeld D.: Econometric models and econometric forecasts. Irwin/McGraw-Hill (1998).
ISBN 0-07-118831-2.
[14] Rosset S., Neumann E., Eick U., Vatnik N., Idan Y.: Customer lifetime value modeling and its use for cus-
tomer retention planning. Proceedings of the eighth ACM SIGKDD international conference on Knowl-
edge discovery and data mining. Edmonton, Canada (2002) 332-340

Resource-bounded Outlier Detection

using Clustering Methods
Luis TORGO a,b,1 , and Carlos SOARES a,c
a
LIAAD/INESC Porto LA, Universidade do Porto, Portugal
b
Faculdade de Ciências, Universidade do Porto, Portugal
c
Faculdade de Economia, Universidade do Porto, Portugal

Abstract.
This paper describes a methodology for the application of hierarchical clustering
methods to the task of outlier detection. The methodology is tested on the problem
of cleaning Ofﬁcial Statistics data. The goal is to detect erroneous foreign trade
transactions in data collected by the Portuguese Institute of Statistics (INE). These
transactions are a minority, but still they have an important impact on the statistics
produced by the institute. The detectiong of these rare errors is a manual, time-
consuming task. This type of tasks is usually constrained by a limited amount of
available resources. Our proposal addresses this issue by producing a ranking of
outlyingness that allows a better management of the available resources by allo-
cating them to the cases which are most different from the other and, thus, have
a higher probability of being errors. Our method is based on the output of stan-
dard agglomerative hierarchical clustering algorithms, resulting in no signiﬁcant
additional computational costs. Our results show that it enables large savings by
selecting a small subset of suspicious transactions for manual inspection, which,
nevertheless, includes most of the erroneous transactions. In this study we com-
pare our proposal to a state of the art outlier ranking method (LOF) and show that
our method achieves better results on this particular application. The results of our
experiments are also competitive with previous results on the same data. Finally,
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

the outcome of our experiments raises important questions concerning the method
currently followed at INE concerning items with small number of transactions.

Keywords. Outlier detection, outlier ranking, hierarchical clustering, data cleaning

Introduction

This paper addresses the problem of detecting errors in foreign trade data (INTRASTAT)
collected by the Portuguese Institute of Statistics (INE). The objective is to identify the
transactions that are most likely to contain errors. The selected transactions will then be
manually analyzed by specialized staff and corrected if an error really exists. The effort
required for manual analysis ranges from simply checking the form that was submitted
to a more contacts with the company that made the transaction to conﬁrm whether the

1 Corresponding Author: Luis Torgo, LIAAD/INESC Porto L.A., Rua de Ceuta, 118, 6., 4050-190 Porto,

Portugal; E-mail: [email protected].

values declared are the correct ones. In any case, the process requires the involvement of
expensive human resources and has significant costs to INE.
Selected transactions are usually the ones with relatively high/low values because
these affect the official statistics that are published by INE the most. Therefore, this
can be cast as an outlier detection problem. The goal is to detect as many of the errors
as possible. However, this task is constrained by the existence of a limited amount of
expensive human resources for the manual detection of errors. Additionally, the amount
of human resources available for the task varies. In busier periods, these resources have to
dedicate less time to this analysis while in quieter times they can do it in a more thorough
way. These constraints pose interesting challenges to outlier-detection methods. Many
of the methods for these detection tasks provide yes/no answers. We claim that this type
of answers leads to sub-optimal decisions when it comes to manually inspecting the
signalled cases. In effect, if the resources are limited we may well get more signals that
we can inspect. In this case, an arbitrary decision must be done to decide which cases are
to be inspected. By providing a rank of outlyingness instead, the resources can be used
on the cases that have a higher probability of error. This problem occurs in many other
applications, namely in fraud detection tasks.
Previous work on this problem has compared outlier detection methods, a decision
tree induction algorithm and a clustering method [1]. The results obtained with the latter
did not achieve the minimum goals that were established by the domain experts, and,
thus, the approach was dropped. Loureiro et al. [2] have investigated more thoroughly the
use of clustering methods to address this problem, achieving a significant boost in terms
of results. Torgo [3] has recently proposed an improvement of the method described
in [2] to obtain degrees of outlyingness. In this work we apply the method proposed by
Torgo [3] to the INE INTRASTAT data and compare it to other alternatives.
Our method uses hierarchical clustering methods to find clusters with few transac-
tions that are expected to contain observations that are significantly different from the
vast majority of the transactions. Rankings of outlyingness are obtained by exploring the
information resulting from agglomerative hierarchical clustering methods.
Our experiments with the INTRASTAT data show that our proposal is competitive
with previous approaches and also with alternative outlier ranking methods.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Section 1 describes the problem being tackled in more detail as well as the results
obtained previously on this application. We then describe our proposal in Section 2.
Section 3 presents the experimental evaluation of our method and discusses the results
we have obtained. In Section 4 we relate our work with others and ﬁnally we present the
main conclusions of this paper in Section 5.

1. Background

In this section we describe the general background, including the problem (Section 1.1)
and previous results (Section 1.2), that provide the motivation for this work.

1.1. Foreign Trade Transactions

Transactions made by Portuguese companies with organizations from other EU countries

are declared to the Portuguese Institute of Statistics (INE) using the INTRASTAT form.
Using this form companies provide information about each transaction, namely:

• Item id,
• Weight of the traded goods,
• Total cost,
• Type (import/export),
• Source, indicating whether the form was submitted using the digital or paper ver-
sion of the form,
• Form id,
• Company id,
• Stock number,
• Month,
• Destination or source country, depending on whether the type is export or import,
respectively.
At INE, the data are inserted into a database. Figure 1 presents an excerpt of a report
produced with data concerning import transactions from 1998 of item with id 101, as
indicated by the ﬁeld labeled “NC”, below the row with the column names.2

Figure 1. An excerpt of the INTRASTAT database. The data were modiﬁed to preserve conﬁdentiality.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Errors often occur in the process of filling forms. For instance, an incorrectly intro-
duced item id will associate a transaction with the wrong item. Another common mis-
take is caused by the use of incorrect units like, for instance, declaring the weight in tons
instead of kilos. Some of these errors have no effect on the final statistics while others
can affect them significantly.
The number of transactions declared monthly is in the order of tens of thousands.
When all of the transactions relative to a month have been entered into the database,
they are manually verified with the aim of detecting and correcting as many errors as
possible. In this search, the experts try to detect unusual values on a few attributes. One
of these attributes is Cost/Weight, which represents the cost per kilo and is calculated
using the values in the Weight and Cost columns. In Figure 1 we can see that the values
for Cost/Weight in the second and last transactions are much lower than in the others.
The corresponding forms were analyzed and it was concluded that the second transaction
is, in fact, wrong, due to the weight being given in grams rather than kilos, while the last
one is correct.
2 Note that, in 1998, the Portuguese currency was the escudo, PTE.

The goal of this project is to reduce the time spent on this task by automatically
selecting a subset of the transactions that includes almost all the errors that the experts
would detect by looking at all the transactions. According to INE experts, to be mini-
mally acceptable the system should select less than 50% of the transactions containing
at least 90% of the errors. However, as stated earlier, given that human resources are
quite expensive, the smaller the number of transactions, the better. Additionally, the same
people are involved in other tasks in INE and sometimes are not available to evaluate
INTRASTAT transactions. Therefore, the number of transactions that can be manually
analyzed varies over different months.
Finally, we note that computational efﬁciency is not important because the automatic
system will hardly take longer than half the time the human expert does.

1.2. Previous Results

Different approaches were tried on this problem. Several months worth of transaction
from 1998 and 1999 were used. The data were provided in the form of two files per
month, one with the transactions before being analyzed and corrected by the experts, and
the other obtained after that process. The integration of the information from the two files
proved much harder than could be expected. Some of the problems found were:
• difficulty in determining the primary key of the tables, even with the help of the
experts;
• some transactions existed in one of the files but not in the other;
• incomplete information, sometimes because it was not filled in the forms, others
due to the reporting software (e.g., values below a given threshold were consid-
ered too low and not printed in the report).
Some of the problems were handled by eliminating the corresponding records, while oth-
ers were simply ignored because they were not expected to affect the data significantly.
This meant that, as it is common in data mining projects, most of the time was spent in
data preparation [4].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Four very different methods were applied. Two come from statistics and are univari-
ate techniques: box plot [5] and Fisher’s clustering algorithm [6]. The third one, Knorr
& Ng’s cell-based algorithm [7], is an outlier detection algorithm which, despite being a
multivariate method, was used only on the Cost/Weight attribute. The last is C5.0 [8], a
multivariate technique for the induction of decision trees.
Although C5.0 is not an outlier detection method, it obtained the best results. This
was achieved with an appropriate transformation of the variables and by assigning dif-
ferent costs to different errors. As a result, 92% of the errors were detected by analyzing
just 52% of the transactions. However, taking advantage of the fact that C5.0 can output
the probability of each case being an outlier, the transactions were ordered by this proba-
bility. Based on this ranking of transactions in terms of their probability of being an error,
it was possible to detect 90% of the errors by analyzing the top 40% of the transactions.
The clustering approach based on Fisher’s algorithm was selected because it ﬁnds
the optimal partition for a given number of clusters of one variable. It was applied to all
the transactions of an item, described by a single variable, Cost/Weight. The transactions
assigned to a small cluster, that is, a cluster containing signiﬁcantly fewer points than
the others, were considered outliers. The distance function used was Euclidean and the

number of clusters was k = 6. A small cluster was deﬁned as a cluster with fewer points
than half the average number of points in the k clusters. The method was applied to data
relative to two months and selected 49% of the transactions which included 75% of the
errors, which did not accomplish the goals set by the domain experts.
Further work based on clustering methods was carried out by Loureiro et al. [2],
who have proposed a new outlier detection method based on the outcome of agglomer-
ative hierarchical clustering methods. Again, this approach used the size of the resulting
clusters as indicators of the presence of outliers. The basic assumption was that outlier
observations, being observations with unusual values, would be distant (in terms of the
metric used for clustering) from the “normal” and more frequent observations, and there-
fore would be isolated in smaller clusters. In [2], several settings concerning the clus-
tering process were explored and experimentally evaluated on the INTRASTAT prob-
lem. The best setup met the requirements of human experts (inspecting less than 50%
of transactions enabled ﬁnding more than 90% of the errors), by detecting 94.1% of the
errors by inspecting 32.7% of the transactions. In spite of this excellent result, the main
drawback of this approach is the fact that it does not allow a control over the amount of
inspection effort we have available. For instance, if 32.7% is still too much for the hu-
man resources currently available we face the un-guided task of deciding which of these
transactions will be inspected. The work presented on this paper tries to overcome this
practical limitation.

2. Hierarchical Clustering for Outlier Ranking

As discussed above, outlier-detection problems with constraints on the amount of re-

sources that limit the maximum number of selected cases can better be handled by pro-
viding a ranking of the examples in terms of their expected level of outlierness. The use
of rankings allows the users to select the number of transactions to inspect according to
the available human resources, with a guarantee that the results for that working point
are “optimal”, at least according to the outlier-ranking method.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Clustering algorithms can be used to identify outliers as a side effect of the cluster-
ing process (e.g. [9]). Most clustering methods rely on a distance metric and thus can
be seen as distance-based approaches to outlier detection [7]. However, iterative meth-
ods like hierarchical clustering algorithms (e.g. [10]) can also handle different density
regions, which is one of the main drawbacks of distance-based approaches. In effect, if
we take agglomerative hierarchical clustering methods, for instance, they proceed in an
iterative fashion by merging two of the current groups (which initially are formed by sin-
gle observations) based on some criterion that is related to their proximity. This decision
is taken locally, that is for each pair of groups, and takes into account the density of these
two groups only. This merging process results in a tree-based structure usually known as
a dendrogram. The merging step is guided by the information contained in the distance
matrix of all available data. Several methods can be used to select the two groups to be
merged at each stage. Contrary to other clustering approaches, hierarchical methods do
not require a cluster initialization process that would inevitably spread the outliers across
many different clusters thus probably leading to a rather unstable approach. Based on
these observations we have explored hierarchical clustering methods for detecting both
local and global outliers [2].

Table 1. Outlier ranking for the example of Figure 2.

Rank CaseID OFH
1 1 0.9091
2 12 0.6818
3 17 0.5909
4 18 0.5909
5 19 0.5455

In this paper we present an approach that takes advantage of the dendrogram gen-
erated by hierarchical clustering methods to produce a ranking of outlyingness. This ap-
proach was ﬁrst described in [3] and is also based on agglomerative clustering methods.
Informally, the idea behind our proposal is to use the height (in the dendrogram) at which
any observation is merged into a group of observations as an indicator of its outlying-
ness. If an observation is really an outlier this should only occur at later stages of the
merging process, that is the observation should be merged at a higher level than “normal”
observations. More formally, we set the outlyingness factor of any observation as,

h
OFH (x) = (1)
N

where h is the level of the hierarchy H at which the case is merged,3 and N is the number
of training cases (which is also the maximum level of the hierarchy by deﬁnition of the
hierarchical clustering process).
One of the main advantages of our proposal is that we can use a standard hierarchi-
cal clustering algorithm to obtain the OFH values without any additional computational
cost. This means our proposal has a time complexity of O(N 2 ) and a space complex-
ity of O(N ) [11]. We use the hclust() function of the statistical software environment
R [12], which is based on Fortran code by F. Murtagh [13]. This function includes in its
output a matrix (called merge) that can be used to easily obtain the necessary values for
calculating directly the value of OFH according to Equation 1.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 2.(a) shows an artiﬁcial data set with two marked clusters of observations
with very different density. As it can be observed there are two clear outliers: observa-
tions 1 and 12. While the former can be seen as a global outlier, the latter is clearly a
local outlier. In effect, it is only regarded as an outlier because of the high density of its
neighbors, as it is in effect nearer observation 2 than, say the 14th from the 15th. How-
ever, as these two latter are in a less compact region their distance is not regarded as a
signal of outlyingness. This is a clear example of a data set with both global and local
outliers and we would like our method to clearly signal both 1 and 12 as observations
with a high probability of being outliers.
Figure 2.(b) shows the dendrogram obtained by using an agglomerative hierarchical
clustering algorithm. As it can be seen, both 1 and 12 are the last observations to be
individually merged into some cluster. As such, it does not come as a surprise that when
running our method on this data we get the top 5 outliers shown on Table 1.
In spite of this success, this method has serious problems when facing compact
groups of outliers. In effect, if we have a data set where there are a few outliers that
3 Counting from bottom up.

8
1
20
0.8 21

6
13 19
22
0.6
14 18

4
Height
15 17
y

0.4

2
16
0.2

0
12

1
2 9
10311

12
56

17
18

19
20
15
16

13
14
21
22
2
4

9
7
6
8
3
10

5
11
0.0

4 78

0.0 0.2 0.4 0.6 0.8

x Cases

Figure 2. An artiﬁcial example.

are very similar to each other, they will be merged with each other very quickly (i.e.,
at a low level of the hierarchy) and thus will have a very low OFH value despite being
outliers. Figure 3 illustrates the problem. For this data set, the method ranks observations
9 and 10, which are clear outliers, as the least probable outliers (they are in effect the
ﬁrst to be merged). This problem is particulary important in our application and also in
fraud detection. In both cases, it is often true that the interesting observations are not
completely isolated from all the others. They sometimes stem from a behavior which,
although rare, is systematic (e.g., a company always declares transactions in counts rather
than in kilos).
1.0

9 5 21 2 3
1 22
10 6 74 8
0 1 2 3 4 5 6
0.8

Height
0.6
y
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

0.4

12 15 11
11
0.2

13 14 16
20

3
8

5
6

1817
12
13

4
7
14
18
15
16

1
2
17
19

9
10

19 20

0.2 0.4 0.6 0.8

x Cases

Figure 3. An artiﬁcial example that is problematic for our initial proposal.

The example of Figure 3 shows a clear failure of our initial proposal. The failure
results from considering only the height at which individual observations are merged and
not groups of observations. When there is a small group of similar observations that is
quite different from others, such that it could make sense to talk about a set of outliers,
they will only be merged with other groups at later stages but they will merge with each
other very early in the process. Therefore, our proposal will not consider this as a sign of
outlyingness of the members of that group. Still, the general idea of our proposal remains
valid so we need to generalize it for these situations. We can do this by assigning a value
similar to that of Equation 1 to all members of the smallest group of any merge that

Table 2. Outlier ranking for the example of Figure 3 using our new proposal.
Rank CaseID OFH
1 9 0.8100
2 10 0.8100
3 11 0.8075
4 15 0.6300
5 16 0.6300

occurs along the hierarchical clustering process. However, we should reinforce this value
with some size-dependent factor (i.e., the smaller the group, the more probable that its
elements are outliers). Formally, for each merge of a group gs with a group gl , where
|gs | < |gl |, we set the outlier factor of the members of gs as,

0
if |gs | > t
OF (gs ) = |gs | h (2)
1− N × N if |gs | < t

where |gs | is the cardinality of the smallest group, gs , t is a threshold that indicates the
number of observations above which a group can not be regarded as a set of outliers for
the data set, and h is the level of the hierarchy where the merge occurs. The OF value
of the larger group gl is set to zero. The value of OF ranges from zero to one, and it is
maximum when a single observation is merged at the last level of the hierarchy.
Any observation can belong to several groups along its upwards path through the
dendrogram. As such, it will probably get several of these scores at different levels. We
set the outlyingness factor of any observation as the maximum OF score it got along
its path through the dendrogram. By proceeding this way we are in effect enabling the
method to detect local outliers, which at some merging stage might have got a very high
score of OF because they are clear outliers with respect to some group that they have
merged with, even though at higher levels of the hierarchy (i.e., seen more globally),
they might not get such high OF values. This means that the outlyingness factor of an
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

observation is given by

OFH (x) = max OF (g) (3)

g∈Gx

where Gx is the set of groups in the dendrogram to which x belongs.

Applying this method to the problematic example of Figure 3, we get the outlier
ranking shown in Table 2. This is the expected result for this problem, which indicates
that this new formulation is able to handle compact and small groups of outlier observa-
tions, like for instance observations 9 and 10 of this problem.

3. Experimental Evaluation

This section describes a series of experiments designed with the goal of checking the
performance of our method on the INTRASTAT data set. We have compared our OFH

method with our previous approach [2] and also with the state of the art in terms of
obtaining degrees of outlyingness: the LOF method [14].
The INTRASTAT data set has some particularities that lead to an experimental
methodology that incorporates some of the experts’ domain knowledge so that the
methodology better meets their requirements.
We start by describing the measures used to assess the quality of the results (Sec-
tion 3.1), then we discuss the experimental setup (Section 3.2), the algorithms that were
tested (Section 3.3) and ﬁnally we discuss the results (Section 3.4).

3.1. Evaluation Measures

In order to evaluate the validity of the resulting methodology we have taken advantage
of the fact that the data set given to us had some information concerning erroneous trans-
actions. In effect, all transactions that were inspected by the experts and were found to
contain errors, were labeled as such. Taking advantage of this information we were able
to evaluate the performance of our methodology in tagging these erroneous transactions
for manual inspection. The experts were particularly interested in two measures of per-
formance: Recall and Percentage of Selected Transactions, which are discussed next.
Recall (%R) can be informally defined in the context of this domain as the propor-
tion of erroneous transactions (as labeled by the experts) that are selected by our models
for manual inspection. Ideally, our models should select a set of transactions for manual
inspection that included all the transactions that were previously labeled by the experts
as errors. However, taking into consideration the difficulty of the problem, INE experts
established the value of 90% as the minimum acceptable recall.
Regarding the percentage of selected transactions (%S) this is the proportion of
all the transactions that are selected for manual inspection by the models. This statistic
quantifies the savings in human resources achieved by using the methodology: the lower
this value the more manual effort is saved. INE experts defined 50% as the maximum
admissible value for this statistic. Given the fact that our method outputs a ranking of
outlyingness we can easily control the value of this measure. The user can decide which
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

percentage of transactions he/she wants to check and then use the ranking provided by
our method to select the transactions corresponding to the selected percentage. Given that
50% is the maximum value and that it is important to release human resources for other
tasks and that the available resources vary, in our experimental evaluation, we have col-
lected results for four different percentage of selected transactions: 35%, 40%, 45% and
50%. All of these settings satisfy the requirements established by the experts concerning
this measure.
An important issue that must be taken into account when analyzing the value of
recall (%R) is the quality of the labels assigned to transactions. When a transaction
is labeled as an error, this classiﬁcation is reliable because it means that the experts
have analyzed the transaction and found an error. However, since not all transactions are
analyzed, there may be some that are labeled as “normal" but are, in fact, errors. Many
of these are transactions that were actually detected by the experts but, because they are
not expected to affect the trade statistics which are computed based on these data, are not
corrected. However, it is possible that some signiﬁcant errors are missed by the experts.
Here, we will not address this issue and simply focus on selecting the errors that were
detected by the domain experts.

Table 3. The “base” results just for including the items with less than 10 transactions.

Jan/1998 Feb/1998 Mar/1998 May/1998 Jun/1998 Aug/1998 Sep/1998 Oct/1998

%S 35.7 30.8 27.7 24.5 32 21.0 17.0 22.5
%R 35.4 40.4 38.7 29.7 37 30.8 25.4 27.9

3.2. Experimental Setup

According to INE experts, the items should be inspected separately due to the rather
diverse distribution of the prices of the products. For instance, the variation of values for
rice is smaller than for heavy machinery. As such we have applied our algorithm to the
set of transactions of each item in turn.
Our outlier ranking method is designed for multivariate analysis. However, follow-
ing another suggestion from the domain experts we have focused our study of the IN-
TRASTAT data set in a single variable, Cost/W eight. Domain experts give particular
attention to this variable as they believe it is the most efficient variable for detecting the
important errors.
Given that INE processes the data on a monthly basis we have decided to use this
very same logic in our tests. This methodology will also enable us to compare our results
with the results obtained in [1], where the same strategy was followed.
One final constraint has an important effect on the results. According to INE ex-
perts, all items with very few transactions, referred to as infrequent items, must be set
for manual inspection. This reduces the number of transactions that the outlier detection
methods may, in fact, select. The domain experts defined 10 as the minimum number
of transactions required for an item to be classified as infrequent. As shown in Table 3,
this fact alone has a big impact on the process. The number of transactions that can be
selected by the outlier detection method is not 50%, as originally established, but ranges
from 15% to 35% (approx.). Furthermore, the concentration of errors in the infrequent
items are generally higher than in the others but not that much higher. In the selected
transactions, the number of errors found represents between 25% and 40% (approx.) of
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

all the errors. Considering, for instance, the month of Jan/1998, the items with less than
10 transactions represent 35.7% of all the transactions and contain 35.4% of the errors.
This means, that, to achieve the target of 90% of Recall, the outlier detection method
needs to ﬁnd almost 55% of the errors by selecting less than 15% of the transactions, to
stay within the maximum effort tolerated by INE experts, which is 50%.
The experimental methodology that we have used is better described by Algorithm 1.
This algorithm calculates the value of the Recall for each month of the testing period,
given a certain desired human effort (given by a provided %S).

3.3. Algorithms

Using Algorithm 1 we have collected the performance, in terms of Recall, of our pro-
posed method and also of the LOF method.
The clustering-based outlier detection method proposed here (Section 2) has several
parameters. The ﬁrst is the agglomeration method used with the hclust() function. In
our experiments we have tested several alternative: the ward, single, complete, average,
mcquitty, median and centroid methods. Another parameter of our method is the distance

Algorithm 1. The experimental methodology.

Require: D, P ercS D is the data set, P ercS is the %S selected by the user
Ensure: %R The vector of %R’s for each month

1: for all m ∈ M onths do

2: T otT rans ← Dm
3: T otErrors ← {tr ∈ Dm : label(tr) = error}
4: T otInsp ← T otRight ← 0
5: OF s ← φ Will contain the outlying factors of all candidate trans.
6: for all i ∈ IT EM S do
7: if {tr ∈ Di,m } < 10 then
8: T otInsp ← T otInsp + {tr ∈ Di,m }
9: T otRight ← T otRight + {tr ∈ Di,m : label(tr) = error}
10: else
11: OF s ← OF s ∪ OutlierRanking(Di,m )
12: end if
13: end for
14: R ← P ercS × T otT rans − T otInsp The %S remaining...
15: if R > 0 then
16: T otInsp ← T otInsp + R
17: OF s ← SortDecreasing(OF s)
18: T otRight ← T otRight + {tr ∈ {OF s}R n=1 : label(tr) = error}
19: end if
20: %Rm ← T otRight/T otErrors
21: end for

function used. For this parameter we have experimented with both the euclidean and
camberra functions. Finally, our method also requires the speciﬁcation of a limit on the
size of a group in order to be selected as a group of (potential) outliers (the t threshold
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

in Equation 2). The possible combinations of these settings makes up for a total of 14
variants of our method.
With respect to LOF, we have used the implementation of this algorithm that is
available in the R package dprep [15]. We have also experimented with 14 variants of
this method, namely by varying the number of neighbours used by the method from 2 to
28 in steps of 2.
In our graphs of results we also plot the %S and %R value of the method described
in [2], which is denoted in the graphs as “LTS04”. This method is not an outlier ranking
algorithm. It simply outputs the (unordered) set of transactions it judges as being outliers,
which leads to a single pair of %S and %R values. In this case the user is not able to
adjust the %S value to the available resources. By chance, in none of the testing months
the 50% limit of selected transactions was surpassed but with this type of methods there
is not such guarantee. In months when the available resources are not sufﬁcient to analyze
all the transactions selected, the experts must decide which ones to let aside. Additionally,
in the months when the number of transactions that could be analyzed by the available
resources is greater than the number of selected transactions, the experts must arbitrarily

LOF.35 LOF.5 OF.H.4

LOF.4 ● LTS04 OF.H.45
LOF.45 OF.H.35 OF.H.5

0.85 0.90 0.95 1.00 0.85 0.90 0.95 1.00

Jun/1998 Aug/1998 Sep/1998 Oct/1998

0.5

0.4

● 0.3
● ●
● 0.2
% Selected

Jan/1998 Feb/1998 Mar/1998 May/1998

0.5

0.4
●
0.3 ● ●
●
0.2

0.85 0.90 0.95 1.00 0.85 0.90 0.95 1.00

% Recall

Figure 4. The results of the experiments on the INTRASTAT data.

select further transactions to check. For the “LTS04” method the same 14 variants used
with OFH were tried.

3.4. Results

Figure 4 shows the results of our comparative experiments in terms of recall (%R) and
percentage of selected transactions (%S) for each of the 8 available testing months. For
each of the methods we have always reported the best result of the 14 variants that were
tried. These can thus be regarded as the best possible outcome of these methods. Each
graph in the ﬁgure represents a month. All graphs have two dotted lines indicating the
experts requirements (at least 90% recall and at most 50% selected transactions). This
means that for each graph the best place to be is the bottom right corner (maximum
%R and minimum %S). Still, the most important statistic is Recall as long as we do
not overcome the 50% limit. The four points for both OFH and LOF represent the
four previously selected working points in terms of %S. Still, we should recall that both
methods would be better represented by lines as any other working points could have
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

been selected. Some of the points are not shown on some graphs because the respective
method achieved a very poor score that is outside of the used axes limits.
The results of our experiments (cf. Figure 4) clearly indicate that our method is
competitive with a state of the art outlier ranking method, LOF. This confirms previous
results on a different set of applications [3]. Moreover, our method is always able to
fulfil the minimum requirement of 90% recall, which is not always the case with LOF.
Compared to “LTS04”, both OFH and LOF lose a few times in terms of achieving the
same %R for the same level of %S. Still, we should recall that “LTS04” provides no
flexibility in terms of available human resources and thus it can happen (as for instance
in Jun/1998) that the solution provided by this method does not attain the objectives of
the experts or even that it is not feasible because it requires too many resources.
As discussed in Section 3.2, the results presented in Figure 4 include all transactions
from infrequent items, i.e, items with less than 10 transactions. An analysis of Figure 4
taking into account the impact of infrequent items (cf. Table 3), raises an important ques-
tion. In effect, the decision of inspecting infrequent items was “imposed” by the INE
experts. However, by looking at our results we think this decision is rather question-
able. For instance, in Jan/1998 the inclusion of the small items incurred in a “cost” of

%S = 35.7%, whilst only allowing us to detect 35.4% of the errors. By simply adding
10% more transactions, our method (OF h.45) was able to boost the recall to 95%. Now
the question is: is it really necessary to analyze all the transactions in infrequent items?
The small amount of data makes the outlier detection method proposed here inappro-
priate for these items. However, it may is possible to use some other form of statistical
decision method to reduce the amount of transactions from infrequent items to analyze.
Our results clearly indicate that statistical-based outlier detection methods are able to do
a much better job than this brute force approach. Therefore, if we can reduce the amount
of effort required for infrequent items, then more resources can be dedicated to analyzing
transactions selected by the outlier detection method proposed here.
A lesson that can be learned from this observation is that not all domain-speciﬁc
knowledge is useful. However, addressing the problem of using automatic methods to
select transactions in infrequent items is not just a technical challenge, caused by the
small volume of data. If we are able to successfuly detect outliers in these items, the next
challenge will be to convince the experts to change their beliefs.

4. Related Work

Outlier detection is a well studied topic (e.g. [16]). Different approaches have been taken
to address this task. Distribution-based approaches (e.g. [17,18]) assume a certain para-
metric distribution of the data and signal outliers as observations that deviate from this
distribution. The main drawbacks of these approaches lie on the constraints of the as-
sumed distributions. Depth-based methods (e.g. [19]) are based on computational ge-
ometry and compute different layers of k-d convex hulls and then represent each data
point in this space together with an assigned depth. In practice these methods are too
inefﬁcient for dealing with large data sets. Knorr and Ng [7] introduced distance-based
outlier detection methods. These approaches generalize several notions of distribution-
based methods but still suffer from several problems, namely when the density of the data
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

points varies (e.g. [14]). Density-based local outliers [20,14] are able to find this type of
outliers and are the appropriate setup whenever we have a data set with a complex distri-
bution structure. These authors defined the notion of Local Outlier Factor (LOF) for each
observation, which naturally leads to the notion of outlier ranking. The key idea of this
work is that the notion of outlier should be “local” in the sense that the outlier degree of
any observation should be determined by the clustering structure in a bounded neighbor-
hood of the observation. In Section 3 we have seen that our method compares favorably
with the LOF algorithm on the problem of detecting errors in portuguese foreign trade
transactions.
Other authors have looked at the problem of outliers from a supervised learning per-
spective (e.g. [1,21]). Usually, the goal of these approaches is to classify a given obser-
vation as being an outlier or as a “normal” case. These approaches are typically affected
by the problem of unbalanced classes that occurs in outlier detection applications, be-
cause outliers are, by definition, much less frequent than the “normal" observations. If
adequate adjustments are not made, this kind of class distribution usually deteriorates the
performance of the supervised models [22].

5. Conclusions

In this paper we have presented a method for obtaining a ranking of outlyingness using an
hierarchical clustering approach. This method uses the height at which cases are merged
in the clustering process as the key factor for obtaining a degree of outlyingness.
We have applied our methodology to the task of detecting erroneous foreign trade
transactions in data collected by the Portuguese Institute of Statistics (INE). The results
of the application of our method to this problem clearly met the performance criteria
outlined by the human experts. Moreover, our results outperform previous approaches to
this same problem. Compared to these previous approaches, our method provides a result
that allows a ﬂexible management of the available human resources for the manual task
of inspecting the potential erroneous transactions.
Our results have also revealed a potential inefﬁciency on the process used by INE to
handle the items with a small number of transactions. In future work we plan to address
these items in a way that we expect to further improve our current results.

Acknowledgements

This work was partially funded by FCT projects oRANKI (PTDC/EIA/68322/2006) and
Rank! (PTDC/EIA/81178/2006) and by a sabbatical grant from the Portuguese govern-
ment to L. Torgo. We would like to thank INE for providing the data used in this study.

References

[1] C. Soares, P. Brazdil, J. Costa, V. Cortez, and A. Carvalho. Error detection in foreign trade data using
statistical and machine learning methods. In N. Mackin, editor, Proc. of the 3rd International Conference
on the Practical Applications of Knowledge Discovery and Data Mining, pages 183–188, 1999.
[2] A. Loureiro, L. Torgo, and C. Soares. Outlier detection using clustering methods: a data cleaning ap-
plication. In Malerba D. and May M., editors, Proceedings of KDNet Symposium on Knowledge-based
Systems for the Public Sector, 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[3] L. Torgo. Resource-bounded fraud detection. In Neves et. al, editor, Proceedings of the 13th Portuguese
Conference on Artificial Intelligence (EPIA’07), LNAI, pages 449–460. Springer, 2007.
[4] Shichao Zhang, Chengqi Zhang, and Qiang Yang and. Data preparation for data mining. Applied
Artificial Intelligence, 17(5 & 6):375 – 381, May 2003.
[5] J.S. Milton, P.M. McTeer, and J.J. Corbet. Introduction to Statistics. McGraw-Hill, 1997.
[6] W.D. Fisher. On grouping for maximum homogeneity. Journal of the American Statistical Association,
53:789–798, 1958.
[7] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based outliers in large datasets. In
Proceedings of 24rd International Conference on Very Large Data Bases (VLDB 1998), pages 392–403.
Morgan Kaufmann, 1998.
[8] R. Quinlan. C5.0: An Informal Tutorial. RuleQuest, 1998.
http://www.rulequest.com/see5-unix.html.
[9] R. Ng and J. Han. Efficient and efective clustering method for spatial data mining. In Proc. of VLDB’94,
1994.
[10] L. Kaufman and P.J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley,
New York, 1990.
[11] F. Murtagh. Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics
Quarterly, 1:101–113, 1984.
[12] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, 2008. ISBN 3-900051-07-0.

[13] F. Murtagh. Multidimensional clustering algorithms. COMPSTAT Lectures 4, Wuerzburg: Physica-

Verlag, 1985.
[14] M. M. Breunig, H. P. Kriegel, R. Ng, and J. Sander. Lof: Identifying density-based local outliers. In
Proceedings of ACM SIGMOD 2000 International Conference on Management of Data, 2000.
[15] Edgar Acuna and members of the CASTLE group. dprep: Data preprocessing and visualization func-
tions for classification, 2008. R package version 2.0.
[16] Victoria Hodge ; Jim Austin. A survey of outlier detection methodologies. Artificial Intelligence Review,
22:85–126, 2004.
[17] D. M. Hawkins. Identification of Outliers. Chapman and Hall, 11 New Fetter Lane, London EC4P 4EE,
1980.
[18] V. Barnett and T. Lewis. Outliers in statistical data. John Wiley, 1994.
[19] F. Preparata and M. Shamos. Computational Geometry: an introduction. Springer-Verlag, 1988.
[20] M. M. Breunig, H. P. Kriegel, R. Ng, and J. Sander. Optics-of: Identifying local outliers. Lecture Notes
in Computer Science, 1704:262–270, 1999.
[21] L. Torgo and R. Ribeiro. Predicting outliers. In N. Lavrac, D. Gamberger, L. Todorovski, and H. Bloc-
keel, editors, Proceedings of Principles of Data Mining and Knowledge Discovery (PKDD’03), number
LNAI in 2838, pages 447–458. Springer, 2003.
[22] G. Weiss and F. Provost. The effect of class distribution on classifier learning: an empirical study.
Technical Report ML-TR-44, Department of computer science, Rutgers University, 2001.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

An Integrated System to Support

Electricity Tariff Contract Deﬁnition
Fátima RODRIGUES 1 and Vera FIGUEIREDO and Zita VALE
GECAD - Knowledge Engineering and Decision Support Group

Abstract. This paper presents an integrated system that helps both retail companies
and electricity consumers on the definition of the best retail contracts and tariffs.
This integrated system is composed by a Decision Support System (DSS) based
on a Consumer Characterization Framework (CCF). The CCF is based on data
mining techniques, applied to obtain useful knowledge about electricity consumers
from large amounts of consumption data. This knowledge is acquired following an
innovative and systematic approach able to identify different consumers’ classes,
represented by a load profile, and its characterization using decision trees. The
framework generates inputs to use in the knowledge base and in the database of the
DSS. The rule sets derived from the decision trees are integrated in the knowledge
base of the DSS. The load profiles together with the information about contracts
and electricity prices form the database of the DSS. This DSS is able to perform
the classification of different consumers, present its load profile and test different
electricity tariffs and contracts. The final outputs of the DSS are a comparative
economic analysis between different contracts and advice about the most economic
contract to each consumer class. The presentation of the DSS is completed with
an application example using a real data base of consumers from the Portuguese
distribution company.

Keywords. electricity markets, load proﬁles, hierarchical clustering, classiﬁcation

Introduction

The full liberalization of most of the electricity markets in Europe and around the world
creates a new environment were several retail companies compete for the electricity sup-
ply of end users. According to [1] the development of decision support tools is of major
importance to show consumers the potential savings they can get by assuming a more
active participation in the electricity markets. Also a company with the opportunity to
trade energy to consumers, in a competitive scenario, must make many decisions and
evaluations to attain the best tariff structure and to study the portfolio of contracts to
offer to consumers. Making these decisions and correctly evaluating the options is a
time consuming process. Another important point is the progressive replacement of tra-
ditional electricity meters for real-time meters, the amounts of data collected will grow
in an exponential manner. The development of frameworks and tools able to extract use-
ful knowledge from this huge volumes of data and use it in decision support, will be a
1 Contact author: School of Engineering, Polytechnic Institute of Porto, Rua Dr António Bernardino de

Almeida, 4200-072 Porto, Portugal, E-mail: [email protected]

competitive advantage for the electricity retails and an important step to get a more ac-
tive demand side participation. The freedom to define a larger portfolio of contracts by
retail companies, and the freedom of consumers to choose between different contracts
and different companies, increases the need of decision support tools to help both sides.
This scenario motivates the development of a decision tool for the selection of the best
electricity retail contract, which is presented in this paper. It is possible to find in the lit-
erature some previous works dedicated to this problem. In [2] data mining techniques are
applied to the problem of load profiling. In [3] a load research project followed by load
profiling is presented and the results of this work are used to support tariff definition. In
[4] consumers classes and its load profiles are defined by clustering techniques and the
results are used to study different contracts for producers. In [5], a framework for the
automatic classification and characterization of electricity consumers is presented, which
is able to deal with large amounts of data and perform the classification of different con-
sumers according to their load profiles. This paper is organized as follows: in section 1, a
description of consumer characterization framework is made, section 2 presents the data
mining module, in section 3, the DSS is described, and in section 4, a practical example
is presented. Finally, in section 5 we present the conclusions.

1. Consumer Characterization Framework

The knowledge about how and when consumers use electricity is essential to develop an
efﬁcient DSS. This knowledge must be obtained from historical data and must be up-
dated to follow the changes on consumer’s behavior. To generate this kind of knowledge
and keep it regularly updated, a comprehensive methodology was developed. Due to the
large amount of data predicted to be available in the future and the need for easy updat-
ing, the CCF provides a clear separation of different steps that include various Data Min-
ing techniques. The proposed framework is based on the study of previous load proﬁling
projects [6,7] and on the structure of the KDD process [8].
In the cleaning phase we check for inconsistencies in the data and outliers are removed.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Anomalous consumption values and outages are detected and replaced based on the in-
formation of similar days. This type of error represent 1% of the total data. In the prepro-
cessing phase missing values are detected and replaced using regression techniques. Lin-
ear regression is used to estimate numerical attributes like missing values of measures,
and logistic regression is used to estimate nominal attributes like the missing commercial
information (about 6% of data), such as activity type, tariff type. The regression models
permit make the substitution of values with a 95% of conﬁdence. With this procedure the
major problems encountered are minimized and the initial data set is clean and complete.
Next, we divide data into subsets. This is made using previous knowledge about the load-
ing conditions, like the season of the year and the type of weekday (working days or
weekends) affect electricity consumption. Data is separated, according to the different
loading conditions, in smaller data sets. We have two data sets representing each season
of the year, one for working days and another for weekends. To obtain a more effective
data reduction, without losing important information, the data from each individual con-
sumer is reduced. This is based on the reduction of the measured daily load diagrams,
corresponding to each loading condition, to one representative load diagram. These rep-
resentative load diagrams are obtained elaborating the data from the measurement cam-

Figure 1. Daily load diagram - over a day

paign. For each consumer, the representative load diagram is built by averaging the mea-
sured load diagrams using a spreadsheet. Each consumer is then described by one single
representative load diagram in each data set, for the different loading conditions (see fig-
ure 1). The diagrams are computed using the field-measurements values, so they need to
be brought together to a similar scale for the purpose of their pattern comparison. This is
achieved through normalization. For each consumer the vector of the representative load
diagram was normalized to the [0-1] range using the peak power of the representative
load diagram. This kind of normalization allows maintaining the shape of the curve and
permits comparing the consumption patterns.
The application of data mining techniques is made using one isolated technique or com-
bining several techniques, to build models able to find relevant knowledge about the dif-
ferent consumption patterns found in data. The implementation of the models involves
several steps, like attribute selection, fitting the models to the data and evaluating the
models. This will be described in the next section.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Data Mining Module

The data mining module for consumer characterization is based on the combination of
unsupervised and supervised learning techniques. After the data pre-processing and re-
duction phase, each consumer is described by its representative load diagram and the
commercial indexes used by the distribution company. The representative daily load di-
agram of the mth consumer is the vector l(m) = {l1(m) , ..., lh(m) } where lh(m) are the
normalized values of the instant power consumed in the instant h and h = 1, .., H with
H = 96, representing the 15 minutes interval between the collected measurements.
The commercial indexes available are of contractual nature (i.e., activity type, con-
tracted power, tariff type, supply voltage level). The distribution company, to classify its
clients, deﬁnes these indexes a priori.
The proposed module is divided in two main sub-modules according to the task they ad-
dress: segmentation and targeting. In the ﬁrst sub-module unsupervised learning, based
on clustering techniques, is used to obtain a partition of the initial sample into a set of

consumer clusters. These clusters represent the different consumption patterns existing
in the available sample. Each of these clusters is represented by its load profile. In the
second sub-module, supervised learning (using decision trees) is used to describe each
cluster by a rule set and create a classification model able to assign consumers to the
existing clusters. The first sub-model is important to the determination and actualization
of the load profiles, and the classification model is also important as new data is collected
to the assignment of new consumers to the existing consumer classes.

2.1. Load Proﬁling Sub-Module

The load profiling sub-module’s goal is the partition of the initial data sample in a set of
clusters defined according to the load shape of the representative load diagrams of each
consumer. This is made assigning to the same cluster consumers with the most similar
behavior, and to different clusters consumers with dissimilar behavior. The first step of
the module development was the selection of the most suitable attributes to be used by
the clustering model. To obtain the best separation between the classes it is important
to use the most detailed information about the shape of the consumers’ load diagrams.
The vectors with the normalized representative load diagrams are the best option. The
number of clusters is an input of the model so it must be defined based on a criterion that
leads to an adequate selection. The number of clusters obtained by the clustering module
was defined from the electricity company, which determined a minimum number of 6
and a maximum number of 9 classes. To define the number of classes, several clustering
operations were performed to study the evolution of the clusters compactness using the
measure Mean Index Adequacy (MIA) presented in [4]. The following distances 1 and 2
are defined to assist the formulation of the adequacy measure:
1. Distance between two load diagrams

H
1
d(li , lj ) = (li (h) − lj (h))2 (1)
H
h=1
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

2. Distance between a representative load diagram and the center of a set of dia-
grams is the geometric mean distance between r(k) and each l(m) element

n
1
d(r(k) , L(k) ) = d(r(k) , l(m) )2 (2)
n(k) m=1

Let us consider a set of M load diagrams separated in k clusters with k = 1, ..., K were
K is the total number of clusters, and each cluster is formed by a subset C (k) of load
diagrams, where r(k) is a pattern assigned to cluster k. The MIA is deﬁned by:

K
1
M IA = d(r(k) , C (k) )2 (3)
K
k=1

The smaller values of MIA indicate more compact clusters. The k-means algorithm was
used to study the cluster tendency of the data set based on the MIA measure. The obtained
results are presented in Figure 2.

Figure 2. MIA evolution with the number of clusters

It is possible to see that 9 clusters would be the best choice, considering the indica-
tion of the distribution company and the evolution of the MIA, because for more then 9
clusters the improvement on the clusters compactness, represented by the decrease of the
MIA values, is not very relevant.
The selection of the most suitable clustering algorithm is described in [6] and was
based on a comparative analysis of the performance of different algorithms. Several al-
gorithms were tested performing different clustering operations. The best results are ob-
tained with a combination of a self-organizing map (SOM) [9] with the classical k-means
algorithm [10]. This combination operates in two levels. In the ﬁrst level the SOM is
used to obtain a reduction of the dimension of the initial data set. The SOM performs the
projection of the H-dimensional space, containing the M vectors representing the load
diagrams of the consumers in the initial data set, into a bi-dimensional space. Two co-
ordinates, representing the SOM attributes in the bi-dimensional space, are assigned to
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

each client. At the end of the first level the initial data set is reduced to the number of
winning units in the output layer of the SOM, represented by its weight vectors. This set
of vectors is able to keep the characteristics of the initial data set and achieve a reduction
of its dimension. In the second level the k-means algorithm is used to group the weight
vectors of the SOM’s units and the final clusters are obtained. The use of the k-means in
the second level allows the definition of the number of clusters as an input of the model.
This combination is very interesting for large data sets, very common in data mining
problems. The SOM has good performance with large data sets and is able to process
large amounts of data, reducing this data to a smaller data set. During the comparative
analysis it was possible to conclude that the k-means algorithm presented a very good
performance with data sets with continuous attributes, like the ones we are using, but this
algorithm presents limitations with large data sets. The combination of both algorithms
was able to solve these limitations and create a solution able to deal with large data sets.
Testing both solutions we were able to conclude that the results obtained were similar,
which proves the effectiveness of the proposed combination. The load profiles for each
class are obtained by averaging the representative load diagrams of the consumers as-
signed to the same cluster.

Table 1. Normalized load shape indexes

Parameter Deﬁnition Period of deﬁnition

Pav,day
Load Factor d1 = Pmax,day
1 day

1 Pav,night
Night Impact d3 = 3 Pav,day
1 day (8 hours night, from 11p.m to 7a.m)

1 Pav,lunch
Lunch Impact d5 = 8 Pav,day
1 day (3 hours day, from 12.00 to 15:00)

2.2. Classiﬁcation Sub-Module

The major goals of the classiﬁcation module are the following:

• inference of a rule set to characterize each class;
• support the attribution of new consumers to the classes obtained by the load pro-
filing module.
The first attempt made and detailed in [7] was the search for the correlation between the
commercial indexes and the classes obtained. The results point out that a poor correla-
tion exists, so it is not possible to create a good classification model based only on the
commercial indexes. This means that new indexes, able to capture relevant information
about the consumption behavior, must be derived to obtain a more complete and use-
ful consumer characterization and create the classification model. These indexes must
contain information about the daily load curve shape of each consumer. Several indexes
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

were proposed in [11] and we selected the most relevant, presented in Table 1.
The classification module uses supervised learning, based on the knowledge about
the relation between the characteristics of the consumer and its corresponding class, ob-
tained with the clustering operation. The model’s goal attribute is the consumer class ob-
tained by the clustering module. The load shape indexes are computed, for each group of
consumers using the representative load diagrams. In order to obtain a reduction of the
range of values assumed by these indexes, and treat them as nominal attributes, they are
replaced by a small number of distinct categories using an interval equalization method
[12]. This method obtains intervals with different sizes, we choose them so that approxi-
mately the same number of consumers falls in each one, to minimize the loss of informa-
tion, due to the replacement of the indexes by a set of discrete categories. This implies a
smaller loss of information, because all the classes will be consider in the model. Each
interval is a class label. This also allows us to treat the load shape indexes in the same
manner we treat the commercial indexes. The classification model inputs are the com-
mercial and the load shape indexes, for each class load profile. The classification algo-
rithm used is the C5.0 [13]. This algorithm was selected because it provides interpretable
models, is adequate to work with nominal attributes and does not require long training

Figure 3. DSS Architecture

times to estimate so it presents good performances with large data sets as the ones used
in data mining. The model evaluation is performed using ten-fold cross validation. This
will increase the computational effort but improves the model’s accuracy, the proportion
of true results (both true positives and true negatives) in the population. The classiﬁca-
tion model creates a complete characterization of consumers’ classes based on the most
relevant attributes selected by the model. This model will be the knowledge base of the
DSS.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

3. Decision Support System

The implementation of the DSS started with the definition of a clear separation between
data storage, knowledge organization and the program that will perform the calculations
to obtain the best contract. This separation is very important to have a flexible DSS able
to deal with large amounts of data [14]. The DSS must easily permit the introduction of
contractual consumer characteristics and the most relevant attributes on the description
of consumer’s behavior. These input parameters will start a decision process that ends
with the presentation of the most adequate contract (the best contract). This decision is
completed by a comparative analysis between all the options available to make clearer the
choice of the best and the potential gain obtained with the change. The internal structure
of the DSS (figure 4) is composed by:
• a knowledge base where the rules sets obtained by the CCF are stored;
• a data base where all the information about load profiles and electricity tariff
structures available, or being tested, in the company are gathered;

• a working memory that contains all the information about the contract that is
either supplied by the user or inferred by the system during the session;
• an inference engine that matches the facts contained in the working memory with
the knowledge contained in the knowledge base, to draw conclusions;
• a user interface to easily collect consumer characteristics as inputs, start the deci-
sion process and present the results in a clear way.

3.1. Knowledge Base

The rules that define and characterize the client’s load profiles are stored in the knowl-
edge base. The rules created in the CCF are described by the load shape indexes: load
factor index (d1), night impact index (d3), and the contracted power. These parameters
are the data input necessary to the DSS to calculate the best contract. For the new clients,
in the beginning this information will be obtained by question them about how many
and type of electronic devices they have, how many people are at home, and so on. To
get an easier and more practical data manipulation by the DSS, we have simplified the
input parameters into a set of intervals. The user doesn’t need to put a specific value for
each parameter, but a categorical value like: low, medium, high, very high and ultra high.
This kind of simplification was necessary to allow the practical application of the DSS to
small consumers without real-time meters. While the electricity meters are not replaced
by real-time meters a low-voltage (LV) consumer does not know the exact value of its
load factor or night impact. On the other hand it is possible, based on simple informa-
tion about consumption habits to predict if these indexes are in the categories presented
bellow. This simplification will lose some precision but will permit the application of the
DSS system on LV consumers. The load factor and the night impact were classified into
the following discretized intervals.

d1 ∈ [0, 2; 0, 3] → low d3 ∈ [0; 0, 2] → low

d1 ∈ [0, 3; 0, 3] → medium d3 ∈ [0, 2; 0, 3] → medium
d1 ∈ [0, 4; 0, 5] → high d3 ∈ [0, 3; 0, 4] → high
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

d1 ≥ 0, 6 → ultrahigh d3 ≥ 0, 6 → ultrahigh

The resulting knowledge base is much simpler, easy to manipulate and permits to per-
form a classiﬁcation on low voltage consumers. Next we present as an example the rule
set obtained for winter-working days classes using a data base of Portuguese consumers.
To each one of these classes we have a different load proﬁle that will be used as reference
to the calculations of costs and observe potential savings.

If d1=High and d3=Very High then class 1

If d1=High and d3=Ultra High then class 1
If d1=Very High and d3=Very High then class 1
If d1=Very High and d3=Ultra High then class 1
If d1=Very High and d3=Medium then class 3
If d1=Ultra High then class 3

If d1=Low then class 4

If d1=Medium then class 4
If d1=High and d3=Low then class 7
If d1=Very High and d3=Medium then class 7
If d1=Very High and d3=Low then class 7
If d1=High and d3=Medium then class 9
If d1=High and d3=High then class 9

The knowledge base is composed by different rule sets corresponding to different load
conditions as winter weekends and working days and summer weekends and working
days. Different calculations are performed for winter or summer.

3.2. Data Base

The information gathered in this database is composed by the load profiles obtained by
the CCF and the tariff structures corresponding to different contracts to be used in the
economical study. This data base comprehends different types of contracts based on the
Portuguese regulated tariffs for 2004, like Fixed Rate (FR), Time-of-Use (TOU) Con-
tracts. These contracts have different prices for peak and off peak hours and consider the
possibility of a weekly based plan, different cycle for working days and weekends (TOU-
WC), or a daily plan (TOU-WDC), same plan for all the week days; Tailored Contracts
(TC); and Real-Time-Pricing (RTP). The DSS can be easily adjusted to perform differ-
ent test simulation to different tailored contracts with different profit levels. Besides the
retailer profit these contracts can also include an insurance factor considering the level
of risk shared by both parts involved in the contract.

3.3. Interface

An interactive and easy to use interface was developed to allow the user to interact with
the DSS. In this interface the inputs necessary to characterize each client are the load
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

factor, the night impact and the contract power. After this information is introduced the
DSS search the load proﬁle more adequate to the client. Next, the DSS performs the
calculations for the available contracts of this load proﬁle, and presents the electricity
costs for the different contractual structures. After that it is possible to choose the most
economical contract. Figure 4 presents the interface with a simulation case as example.

4. Simulation Results

The DSS was tested and validated running a large number of simulations for differ-
ent possible clients and the results obtained were very satisfactory. The DSS is ﬂexible
enough to study different types of contracts and adjust them in a very simple way. The
system is able to deal with large amounts of data and expandable to work with real time
actualization of the database. We present as example the results obtained by the DSS.
This is based on a simulation using the following consumer characteristics:

Inputs: d1-High; d3-Low; CP-9.9 KVA

Figure 4. DSS User Interface

Outputs: Working days: Class 7; Weekends: Class 5;

Best Contract: Tailored Contract(TC).
A comparative economical analysis of different contracts is also presented (see
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Available contracts for this load proﬁle in ﬁgure 4). For consumers with high load factor
(d1) and low night impact (d3) the most economical contract is the proposed Tailored
Contract (TC), followed by the Real-Time Pricing (RTP). The other existent contracts are
more expensive. TOU-WC and TOU-WDC present the same value for this type of client,
and FR is the most expensive. From this example we can conclude that the electricity
prices existing at the moment can be improved. This simulation is repeated for a large
number of different situations with different input factors, and RTP and TC are always
the best option.

5. Conclusions

A robust and ﬂexible DSS for the study, test and selection of the most adequate electricity
contract is presented. This DSS uses as inputs the most relevant load shape indexes and

the contracted power of a consumer and provides as outputs the consumer load profile,
an economical comparative analysis of different contracts and finally the decision about
the most adequate contract. The knowledge base of the DSS is the result of a CCF based
on Data Mining techniques developed to extract useful knowledge from large amounts of
consumers’ data. Its database is composed by the different classes’ load profiles and the
contracts that are being tested. The DSS is tested and validated with a real data case from
the Portuguese distribution company. The results obtained after running a large number
of simulations are very satisfactory. This DSS is useful both for retail companies and for
electricity consumers, because it helps to define the contract that best fits with the client
load profile.

Acknowledgements

This work was developed in DaMICE Project (Ref POCTI/ESE/39744/2001) supported

by the Portuguese Science and Technology Foundation.

References

[1] D. Kirchen, Demand-Side View of Electricity Markets, IEEE Transactions on Power Systems, vol=18,
No 2, pp.520-526,May 2003.
[2] B. Pitt and D. Kirchen, Application of Data Mining Techniques to Load Proﬁling, IEEE Transactions on
Power Systems, May 1999.
[3] C. Chen, J.C. Hwang and C.W. Huang, Application of Load Survey to Proper Tariff Design, IEEE
Transactions on Power Systems, vol=12 No. 4, pp.1746-1751, November 1997.
[4] G. Chicco, R. Napoli, P. Postulache, M. Scutariu and C. Toader, Customer Characterization Options for
Improving the Tariff Offer, IEEE Transactions on Power Systems, vol=18, No 1, pp.381-387, February
2003.
[5] V. Figueiredo, F. Rodrigues, Z. Vale and B. Gouveia, An Electric Energy Characterization Framework
based on Data Mining Techniques, IEEE Transactions on Power Systems, vol=20 No. 2, pp.596-602,
May 2005.
[6] F. Rodrigues, V. Figueiredo, J. Duarte and Z. Vale, A Comparative Analysis of Clustering Algorithms
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Applied to Load Proﬁling, Lecture Notes in Artiﬁcial Intelligence (LNAI 2734), pp.73-85, Springer-
Verlag 2003.
[7] V. Figueiredo,J. Duarte, F. Rodrigues, Z. Vale and Borges Gouveia, Electric Energy Customer Charac-
terization by Clustering, Proceedings of ISAP 2003, Lemnos, Greece.
[8] U. Fayyad, G. Piatetsky-Shapiro, P.J. Smith and R. Uthurasamy, From data Mining to Knowledge Dis-
covery: An Overview, in Advances in Knowledge Discovery and Data Mining, pp.1-34, AAAI/MIT
Press 1996.
[9] T. Kohonen, Self-Organisation and Associative Memory, 3rd Ed. Springer-Verlag, 1989.
[10] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data, Prentice Hall, 1988.
[11] M. Ernoult and F. Meslier, Analysis and Forecast of Electrical Energy Demand, Revue Général dŠElec-
tricité, Vol. 4, pp.381-387, 1982.
[12] I. Witten and E. Frank, Data Mining Ű Practical Machine Learning Tools and Techniques with Java
implementations, Morgan Kaufmann Publishers, Academic Press 2002.
[13] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, 1993.
[14] Efraim Turban and Jay Aranson, Decision Support Systems and Intelligent Systems, Prentice Hall 1998.

Mining Medical Administrative Data -

The PKB Suite1
Aaron CEGLAR a , Richard MORRALL b , John F. RODDICK a,2
a Flinders University, Bedford Park, South Australia 5042
b PowerHealth Solutions, Adelaide, South Australia 5000

Abstract.
Hospitals are adept at capturing large volumes of highly multi-dimensional data
about their activities including clinical, demographic, administrative, ﬁnancial and,
increasingly, outcome data (such as adverse events). Managing and understanding
this data is difﬁcult as hospitals typically do not have the staff and/or the expertise
to assemble, query, analyse and report on the potential knowledge contained within
such data. The Power Knowledge Builder (PKB) project investigated the adaption
of data mining algorithms to the domain of patient costing, with the aim of helping
practitioners better understand their data and therefore facilitate best practice.

Keywords. PKB, Medical Knowledge Discovery

Introduction

Hospitals are driven by the twin constraints of maximising patient care while minimising
the costs of doing so. For public hospitals in particular, the overall budget is generally
ﬁxed and thus the quantity (and quality) of the health care provided is dependent on the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

patient mix and the costs of provision.

Some of the issues that hospitals have to handle are frequently related to resource
allocation. This requires decisions about how best to allocate those resources, and an
understanding of the impacts of those decisions. Often the impact can be seen clearly
(for example, increasing elective surgery puts more surgical patients in beds, as a result
constraining admissions from the emergency department leading to an increase in wait-
ing times in the emergency department) but the direct cause may not be apparent (is the
cause simply more elective patients? is it discharge practices? has the average length of
stay changed? is there a change in the casemix in emergency or elective patients? and so
on). Analysing the data with all these potential variables is difﬁcult and time consuming.
Focused analysis may come up with a result that explains the change but an unfocussed
analysis can be a fruitless and frustrating exercise.

1 We are indebted to the staff at PowerSolutions Pty Ltd with whom this suite was developed as part of the
PSD system (http://www.powerhealthsolutions.com/). The research was funded, in part, by an
AusIndustry START grant.
2 Corresponding Author. School of Computer Science, Engineering and Mathematics, Flinders University,
Adelaide, South Australia 5001. Email: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
A. Ceglar et al. / Mining Medical Administrative Data – The PKB Suite 111

As a part this resource pressure, hospitals are often unable to have teams of analysts
looking across all their data, searching for useful information such as trends and anoma-
lies. For example, typically the team charged with managing the patient costing system,
which incorporates a large data repository, is small. These staff may not have a strong
statistical/epidemiological background or the time or tools to undertake complex multi-
dimensional analysis or data mining. Much of their work is in presenting and analysing
a set of standard reports, often related to the financial signals that the hospital responds
to (such as cost, revenue, length of stay or casemix). Even with OLAP tools and report
suites it is difficult for the users to look at more than a small percentage of the available
dimensions (usually related to the known areas of interest) and to undertake some ad hoc
analysis in specific areas, often as a result of a targeted request, e.g. what are the cost
drivers for liver transplants?
Even disregarding the trauma of an adverse patient outcome, adverse events can be
expensive in that they increase the clinical intervention required, resulting in higher-than-
average treatment costs and length-of-stay, and can also result in expensive litigation.
Unfortunately, adverse outcomes are not rare. A study by Wolff et al. [1] focusing on rural
hospitals estimated that .77% of patients experienced an adverse event while another by
Ehsani et al., which included metropolitan hospitals, estimated a figure of 6.88% [2].
The latter study states that the total cost of adverse events ... [represented] 15.7% of the
total expenditure on direct hospital costs, or an additional 18.6% of the total inpatient
hospital budget. Given these indicators, it is important that the usefulness of data mining
techniques in reducing artefacts such as adverse effects is explored.
A seminal example of data mining use within the hospital domain occurred during
the Bristol Royal Infirmary inquiry of 2001 [3] in which data mining algorithms were
used to create hypotheses regarding the excessive number of infant deaths at the Bristol
Royal Infirmary that underwent open-heart surgery. In a recent speech, Sir Ian Kennedy
(who lead the original inquiry) said, with respect to improving patient safety, that The
[current] picture is one of pockets of activity but poor overall coordination and limited
analysis and dissemination of any lessons. Every month that goes by in which bad, unsafe
practice is not identified and rooted out and good practice shared, is a month in which
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

more patients die or are harmed unnecessarily. The roll of data mining within hospital
analysis is important given the complexity and scale of the analysis to be undertaken.
Data mining can provide solutions that can facilitate the benchmarking of patient safety
provision, which will help eliminate variations in clinical practice, thus improving patient
safety.
The Power Knowledge Builder (PKB) project provides a suite of data mining ca-
pabilities, tailored to this domain. The system aims to alert management to events or
items of interest in a timely manner either through automated exception reporting, or
through explicit exploratory analysis. The initial suite of algorithms (trend analysis, re-
source analysis, outlier detection and clustering) were selected as forming the core set of
tools that could be used to perform data mining in a way that would be usable to educated
users, but without the requirement for sophisticated statistical knowledge.
To our knowledge, PKB’s goal is unique – it is industry speciﬁc and does not require
specialised data mining skills, but aims to leverage the data and skills that hospitals al-
ready have in place. There are other current data mining solutions, but they are typically
part of a more generic reporting solutions (i.e. Business Objects, Cognos) or sub-sets of
data management suites such as SAS or SQL server. These tools are frequently powerful

Control Panel Outlier Cluster Characterisation Scatterplot Visualisation Summary Table

Figure 1. Prototype Application Snapshot

and ﬂexible, but are not targeted to an industry, and to use them effectively requires a
greater understanding of statistics and data mining methods than our target market gen-
erally has available. This paper introduces the PKB suite and its components in Section
1. Section 2 discusses some of the important lessons learnt, while Section 3 presents the
current state of the project and the way forward.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

1. The PKB Suite

The PKB suite is a core set of data mining tools that have been adapted to the patient
costing domain. The initial algorithm set (anomaly detection, trend analysis and resource
analysis), was derived through discussion with practitioners, focusing upon potential ap-
plication and functional variation. Subsequently, clustering and characterisation algo-
rithms were appended to enhance usefulness. The current application prototype is writ-
ten in Java 1.4 for compatibility purposes and is multi-threaded to allow for multiple
concurrent analysis instances. The tool, an example of which is presented in Figure 1, is
feature rich, providing algorithms for a number of data mining tasks including clustering,
characterisation and outlier detection.
Each algorithmic component has an interface wrapper, which is subsequently incor-
porated within the PKB prototype. The interface wrapper provides effective textual and
graphical elements, with respect to pre-processing, analysis and presentation stages, that
simpliﬁes both the use of PKB components and the interpretation of their results. This is

Figure 2. Outlier Analysis with Characterisation Tables and Selection

important as the intended users are hospital administrators, not data mining practitioners
and hence the tools must be usable by educated users, without requiring sophisticated
statistical knowledge.

1.1. Outlier and Cluster Analysis

Outlier (or anomaly) detection is a mature field of research with its origins in statistics
[4]. Current techniques typically incorporate an explicit distance metric, which deter-
mines the degree to which an object is classified as an outlier. A more contemporary
approach incorporates an implied distance metric, which alleviates the need for the pair-
wise comparison of objects [5,6] by using domain space quantisation to enable distance
comparisons to be made at a higher level of abstraction and, as a result, obviates the need
to recall raw data for comparison.
The PKB outlier detection algorithm CURIO contributes to the state of the art in
outlier detection, through novel quantisation and object allocation, that enables the dis-
covery of outliers in large disk resident datasets in two sequential scans [7]. Furthermore
CURIO addresses the need realised during this project of the algorithm to discover not
only outliers but also outlier clusters. By clustering similar (close) outliers and present-
ing cluster characteristics it becomes easier for users to understand the common traits
of similar outliers, assisting the identification of outlier causality. An outlier analysis
instance is presented in Figure 2, showing the interactive scatterplot matrix and cluster
summarisation tables.

B B
D D
G EF G EF

A A
C C

Figure 3. Example Grid Figure 4. With increased precision

Outlier detection has the potential to find anomalous information that is otherwise
lost in the noise of multiple variables. Hospitals are used to finding (and in fact expect
to see) outliers in terms of cost of care, length of stay etc. for a given patient cohort.
What they are not so used to finding are outliers over more than two dimensions, which
can provide new insights into the hospital activities. The outlier component presents pre-
processing and result interfaces, incorporating effective interactive visualisations that
enable the user to explore the result set, and see common traits of outlier clusters through
characterisation.
Given CURIO’s cluster based foundations, clustering (a secondary component) is
a variation of CURIO that effectively finds the common clusters rather than anomalous
clusters. Given their common basis, both the outlier and clustering components require
the same parameters and use the same type of result presentations. By clustering sim-
ilar (close) outliers and presenting cluster characteristics it becomes easier for users to
understand the common traits of similar outliers, assisting the identification of outlier
causality. Although proposed as an area of further work by Knorr [8], the realisation of
this functionality is novel and enhances the utility of the CURIO algorithm [7].
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Given the simple premise that outliers are distant from the majority of objects when
represented in Euclidean space, if this κ-D space is quantised, outliers are those objects
in relatively sparse cells, where the degree of relative sparsity is dictated by some tol-
erance T . Given T = 4, Figure 3 presents a 2-dimensional grid illustrating both poten-
tial (grey) and valid (white labelled) outlier objects. However this simple approach can
validate false positives as indicated by A, which is on the edge of a dense region. This
problem can be resolved by either creating multiple offset grids or by undertaking a NN
(Nearest Neighbour) search. Multiple offset grids requires the instantiation of many grids
which are slightly offset from each other. This effectively alters the cell allocation of ob-
jects, and requires a subsequent voting system to determine if an object is to be regarded
as an outlier. An alternative NN search explores the bounding cells of each identiﬁed po-
tential outlier cell and if the number of objects within this neighbourhood exceeds T , all
objects residing within the cell are eliminated from consideration. Both techniques were
investigated and the neighborhood search was found to be more accurate, and hence is
the one presented.
This overarching theory provides the foundation of CURIO, enabling disk resident
datasets to be analysed in two sequential scans. The quantisation and subsequent count

based validation effectively discovers outliers indicating that explicit distance threshold
is not required and in fact often needlessly complicates the discovery process. CURIO
incorporates an implied distance metric through cellular granularity, where the ﬁner the
granularity the shorter the implied distance threshold. This precision parameter P effec-
tively quantises each dimension into 2 P equal length intervals. For example, Figure 4
illustrates the effect of increasing P by 1 (effectively doubling the number of partitions),
resulting in potentially more outliers.

1.2. Characterisation

Characterisation allows users to understand more deeply the nature of their data. For
example, a cluster of high-cost cases may be found on a certain day and characterisation
allows users to investigate why this phenomenon occurs – perhaps there is some latency
associated with weekend admissions.
Characterisation (also a secondary component) was initially developed as a sub-
sidiary for outlier and clustering analysis in order to present descriptive summaries of
the clusters to the users. However it is also present as an independent tool within the
suite. The characterisation algorithm provides this descriptive cluster summary by find-
ing the sets of commonly co-occurring attribute values within the set of cluster objects.
To achieve this, a partial inferencing engine, similar to those used in association min-
ing [9] is used. The engine uses the extent of an attribute value’s (elements) occurrence
within the dataset to determine its significance and subsequently its usefulness for sum-
marisation purposes. Once the valid elements have been identified, the algorithm deep-
ens finding progressively larger, frequently co-occurring elements sets from within the
dataset.
Given the target of presenting summarised information about a cluster, the valid ele-
ments are those that occur often within the source dataset. While this works well for non-
ordinal (and other low cardinality) data as ordinal data requires partitioning into ranges,
allowing the significant mass to be achieved. This is accomplished by progressively re-
ducing the number of partitions, until at least one achieves a significant volume. Given
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

the range 1 to 100, and initial set of 26 partitions are formed, if no partition is valid, each
pair of partitions are merged, by removing the lowest signiﬁcant bit, (25 partitions). This
process continues until a signiﬁcant mass is reached. This functionality is illustrated in
Figure 2 through the presentation of a summarisation table with ordinal-ranges.

1.3. Resource analysis

Resource usage analysis is a domain specific application that provides a tool that analyses
patterns of resource use for patient episodes (hospital stays). This novel algorithm is
backed by an extended inferencing engine [9], that provides dynamic numeric range
partitioning, temporal semantics and affiliated attribute quantisation, to provide a rich
analysis tool. Furthermore the tool enables the association of extraneous variables with
the resource patterns such as average cost and frequency. The resource usage results
are presented as a set of sortable tables, where each table relates to a specified dataset
partition. For example, the user can specify the derivation of daily resource usage patterns
for all customers with a particular Diagnosis Related Group, partitioned by consulting
doctor. By associating average cost and frequency with these patterns, useful information

Figure 5. Resource Analysis.

regarding the comparative cost effectiveness of various doctors may be forthcoming. A

screenshot of the resource analysis presentation is provided in Figure 5 illustrating the
clustering of daily resource use for consulting doctor id 1884. Each row represents a
cluster of resources used. Further questions such as is it signiﬁcant that a chest x-ray for
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

one patient took place on day one, while for another patient in the same cohort it took
place on day two? can also be addressed. The resource analysis component automates
and simpliﬁes what would have previously been very complex tasks for costing analysts
to perform.

1.4. Trend Analysis

Trend or time-series analysis is a comparative analysis of collections of observations

made sequentially in time. Based upon previous research [10,11], the underlying analysis
engine undertakes similarity based time series analysis using Minkowski Metrics and re-
moving distortion through offset translation, amplitude scaling and linear trend removal.
The component provides a rich higher level functional set incorporating, temporal varia-
tion, the identiﬁcation of offset and subset trends, and the identiﬁcation of dissimilar as
well as similar trends, as illustrated in Figure 6. The component provides a comprehen-
sive interface including ordered graphical pair-wise representations of results for com-
parison purposes. For example, the detection of an offset trend between two wards with
respect to admissions can indicate a causality that requires further investigation.

Figure 6. Trend Analysis Presentation: parameter setting

2. Lessons Learned

The PKB project began as a general investigation into the application of data mining
techniques to patient costing software, between Flinders University and PowerHealth
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Solutions, providing both an academic and industry perspective. Now, 18 months on from
its inception, many lessons have been learnt that will hopefully aid both parties in future
interaction with each other and with other partners. From an academic viewpoint, issues
relating to the establishment of beta test sites and the bullet prooﬁng of code is unusual.
While from industry the meandering nature of research and the potential for somewhat
tangential results can be frustrating. Overall, three main lessons have been learnt.

Solution looking for a problem? It is clear that understanding data and deriving usable
information and insights from it is a problem in hospitals, but how best to use the
research and tools is not always clear. In particular, the initial project specification
was unclear as to how this would be achieved. As the project evolves it is crys-
tallising into a tool suite that complements PowerHealth Solution’s current report-
ing solution. More focus upon the application of the PKB suite from the outset
would have sped up research but may also have constrained the solutions found.
Educating practitioners. The practical barriers to data mining reside more in the struc-
turing and understanding of the source data than in the algorithms themselves. A
significant difficulty in providing data mining capabilities to non-experts is the re-

quirement for the users to be able to collect and format source data into a usable
format. Given knowledge of the source data, scripts can easily be established to ac-
complish the collection process. However where the user requires novel analysis,
an understanding of the required source data is required. It is possible to abstract
the algorithmic issues away from the user, providing user-friendly GUI’s for re-
sult interpretation and parameter specification, however this is difficult to achieve
for source data specification, as the user must have a level of understanding with
respect to the nature of the required analysis in order to adequately specify it.
Pragmatics. The evaluation of the developed tools requires considerable analysis, from
both in-house analysts and analysts from third parties who have an interest in the
PKB project. The suite is theoretically of benefit, with many envisaged scenarios
(based upon experience) where it can deliver useful results, but it is difficult to find
beta sites with available resources.

3. Current State and Further Work

The second version of the PKB suite is now at beta-test stage, with validation and further
functional refinement required from industry partners. The suite currently consists of a
set of fast algorithms with relevant interfaces that do not require special knowledge to
use. Of importance in this stage is feedback regarding the collection and pre-processing
stages of analysis and how the suite can be further refined to facilitate practitioners in
undertaking this.
The economic benefits of the suite are yet to be quantified. Expected areas of benefit
are in the domain of quality of care and resource management. Focusing upon critical
indicators, such as death rates and morbidity codes, in combination with multiple other
dimensions (e.g. location, carer, casemix and demographic dimensions) has the potential
to identify unrealised quality issues.
Three immediate areas of further work are evident: the inclusion of extraneous
repositories, knowledge base construction and textual data mining. The incorporation of
extraneous repositories such as meteorological and socio-economic within some analy-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

sis routines can provide useful information regarding causality. While the incorporation
of an evolving knowledge base will facilitate analysis by either eliminating known in-
formation from result sets or the ﬂagging of critical artefacts. As most hospital data is
not structured, but contained in notes, descriptions and narrative the mining of textual
information will also be valuable.

References

[1] Wolff, A.M., Bourke, J., Campbell, I.A., Leembruggen, D.W.: Detecting and reducing hospital adverse
events: outcomes of the wimmera clinical risk management program. Medical Journal of Australia 174
(2001) 621–625
[2] Ehsani, J.P., Jackson, T., Duckett, S.J.: The incidence and cost of adverse events in victorian hospitals
2003-04. Medical Journal of Australia 184 (2006) 551–555
[3] Kennedy, I.: Learning from Bristol: The report of the public inquiry into children’s heart surgery at the
Bristol Royal Inﬁrmary 1984-1995. Final report, COI Communications (2001)
[4] Markou, M., Singh, S.: Novelty detection: a review - part 1: statistical approaches. Signal Processing
83 (2003) 2481–2497

[5] Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In Gupta, A.,
Shmueli, O., Widom, J., eds.: 24th International Conference on Very Large Data Bases, VLDB’98, New
York, NY, USA, Morgan Kaufmann (1998) 392–403
[6] Papadimitriou, S., Kitagawa, H., Gibbons, P., Faloutsos, C.: LOCI: Fast outlier detection using the local
correlation integral. In: 19th International Conference on Data Engineering (ICDE), Bangalore (2003)
315–326
[7] Ceglar, A., Roddick, J.F., Powers, D.M.: CURIO: A fast outlier clustering algorithm for large datasets.
In Ong, K.L., Li, W., Gao, J., eds.: Second International Workshop on Integrating AI and Data Mining
(AIDM 2007). Volume 84 of CRPIT., Gold Coast, Australia, ACS (2007) 37–45
[8] Knorr, E.: Outliers and Data Mining: Finding Exceptions in Data. PhD thesis, University of British
Columbia (2002)
[9] Ceglar, A., Roddick, J.F.: Association mining. ACM Computing Surveys (2006)
[10] Brockwell, P.J., Davis, R.A.: Time Series: Theory and Methods. Springer, NY (1987)
[11] Keogh, E.K., Chakrabarti, K. Mehorta, S., Pazzani, M.: Locally adaptive dimensionality reduction for
indexing large time series databases. In: ACM SIGMOD International Conference on Management of
Data, Santa Barbara, CA (2001) 151–162
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Clustering of Adolescent Criminal

Offenders using Psychological and
Criminological Proﬁles
Markus BREITENBACH 1 , Tim BRENNAN, William DIETERICH and
Greg GRUDIC

Abstract In Criminology research the question arises if certain types of delin-

quents can be identiﬁed from data, and while there are many cases that can not be
clearly labeled, overlapping taxonomies have been proposed in [1,2,3]. In a recent
study Juvenile offenders (N = 1572) from three state systems were assessed on a
battery of criminogenic risk and needs factors and their ofﬁcial criminal histories.
Cluster analysis methods were applied. One problem we encountered is the large
number of hybrid cases that have to belong to two or more classes. To eliminate
these cases we propose a method that combines the results of Bagged K-Means
and the consistency method [4], a semi-supervised learning technique. A manual
interpretation of the results showed very interpretable patterns that were linked to
existing criminologic research.

Introduction

Unsupervised clustering has been applied successfully in many applied disciplines to

group cases on the basis of similarity across sets of domain speciﬁc features. A typical
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

analytical sequence in the data mining process is to first identify clusters in the data,
assess robustness, interpret them and later train a classifier to assign new cases to the
respective clusters.
The present study applies several unsupervised clustering techniques to a highly dis-
puted area in criminology i.e. the existence of criminal offender types. Many contem-
porary criminologist argue against the possibility of separate criminal types [5] while
others strongly support their existence (see [2,6]). Relatively few studies in criminology
have used Data Mining techniques to identify patterns from data and to examine the ex-
istence of criminal types. To date, the available studies have typically used inadequate
cross verification techniques, small and inadequate samples and have produced incon-
sistent or incomplete findings, so that it is often difficult to reconcile the results across
these studies. Often, the claims for the existence of “criminal types” have emerged from
psychological or social theories that mostly lack empirical verification. Several attempts
have been made to integrate findings from available classification studies. While these
efforts have suggested some potential replications of certain offender types they have
1 Corresponding Author: Department of Computer Science, University of Colorado at Boulder, 430

UCB,Boulder, CO 80309-0430, U.S.A.; E-mail: [email protected]

been limited by their failure to provide clear classification rules e.g. a psychopathic cat-
egory has emerged from a large clinical literature but there is much dispute over how to
identify them, what specific social and psychological causal factors are critical, whether
or not this type exists among female offenders or among adolescents and whether there
are “sub-types” of psychopaths. Thus, a current major challenge in criminology is to ad-
dress whether reliable patterns or types of criminal offenders can be identified using data
mining techniques and whether these may replicate the criminal profiles as described in
the prior criminological literature.
In the present study Juvenile offenders (N = 1572) from three U.S. state juvenile
justice systems were assessed on a battery of criminogenic risk and needs factors as well
as official criminal histories. Data mining techniques were applied with the goal of iden-
tifying intrinsic patterns in this data set and to assess whether these replicate any of the
main patterns previously proposed in the criminological literature [6]. The present study
thus aimed to identify empirical patterns within this juvenile justice population and to
examine how they relate to certain theorized patterns from the prior criminological lit-
erature. The implications of these findings for Criminology are manifold. The findings
firstly suggest that certain offender patterns can be reliably identified using several data
mining unsupervised clustering techniques. Secondly, the findings appear to offer a chal-
lenge to those criminological theorists who hold that there is only one general “global
explanation" of criminality as opposed to multiple pathways with different explanatory
models (see [7]).
From a methodological perspective the present paper illustrates some of the diffi-
cult analytical problems encountered in applied criminological research that mainly stem
from the kind of data encountered in this field. A first major problem is that the data is
noisy and often unreliable. Second, the empirical clusters are not clear cut so that cases
range from strongly classified to poorly classified boundary cases with only weak cluster
affiliations. Certain cases may best be seen as hybrids (close to cluster boundaries) or
outliers. Additionally, some distortion of clusters can be a problem since many cluster-
ing algorithms assign a label to every point in the data, including outliers. Such “forc-
ing” of ill-fitting members - both hybrids and outliers cases - may distort the quality and
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

interpretation of the clustering results. Standard methods such as K-Means will assign
cases to the closest cluster center no matter how “far away” from the cluster centers
the points are. While other algorithms such as EM-Clustering [8] output probabilities of
class-membership, the elimination of outliers in a unsupervised setting is a hard problem
in this area of applied research. In this context we acknowledge that much work has been
done to make clustering more robust against outliers, such as using clustering ensembles
[9,10] or combining the results of different clustering methods [11], but a clear challenge
is to develop effective methods to eliminate cases in an aggressive way to obtain refined
clustering solutions, i.e. in removing points that are not “close enough” to the cluster
center.
Thus, in this research we demonstrate a methodology to identify well clustered
cases. Specifically we combined a semi-supervised technique with an initial standard
clustering solution. This process identified several highly replicated offender types with
clear definitions of reliable and core criminal patterns. Of substantive criminological in-
terest we found that these replicated clusters provided social and psychological profiles
that have a strong resemblance to certain of the criminal types proposed in prior litera-
ture by leading criminologists [6,2]. However, the present findings go beyond these prior

typological proposals firstly by grounding the type descriptions in clearly defined em-
pirical patterns. Secondly they allow explicit classification rules for each of the offender
type that have been generally absent from the prior criminological literature.

1. Method

Juvenile offenders (N = 1572) from three state systems were assessed on a battery of
criminogenic risk and needs factors using the Youth COMPAS assessment instrument
described in [12] and their ofﬁcial criminal histories. The scales measured assess various
areas of risk and desistance such as relationship with the youths’ family, school, sub-
stance abuse, aggressive behaviors, abuse, social-economic situation and social factors.
We started with a provisional initial solution obtained “manually” using standard
K-means and Ward’s minimum-variance method. These approaches have been the pre-
ferred choice in many social and psychological studies to ﬁnd hidden or latent typologi-
cal structure in data [13,14].
Despite its success standard K-means is vulnerable to data that do not conform to the
minimum-variance assumption or expose a manifold structure, that is, regions (clusters)
that may wind or straggle across a high-dimensional space. The initial K-means clusters
were also vulnerable to remaining outliers or noise in the data. Thus, we proceeded with
two additional methods designed to deal more effectively with these outlier and noise
problems.

1.1. Bagged K-Means

Bagging has been used with success for many classification and regression tasks [15]. In
the context of clustering, bagging generates multiple classification models from bootstrap
replicates from the selected training set and then integrates these into one final aggregated
model. By using only two-thirds of the training set (with some cases repeated) to create
each model, we aimed to achieve models that should be fairly uncorrelated so that the
final aggregated model may be more robust to noise or any remaining outliers inherent
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

in the training set.

In [16] a method that combined Bagging and K-means clustering was introduced. In
our analyses we used the K-means implementation in R [17]. We generated 1000 random
bags from our initial sample of 1,572 cases with no outliers removed to obtain cluster
solutions for each bag. The centers of these bags were then treated as data points and re-
clustered with K-means. The final run of this K-means was first seeded with the centers
from our Ward solution (described above), which was then tested against one obtained
with randomly initialized centers. These resulted in the same solution, suggesting that
initializing the centers in these ways did not unduly bias K-means convergence. The
resulting stable labels were then used as our final centers for the total dataset and in the
voting procedure outlined below.

1.2. Semi-Supervised Clustering

We will now brieﬂy summarize the semi-supervised labeling method proposed in [4].
Given a set of points X ∈ Rn×m and labels L = {1, · · · , c}. Let xi denote the ith ex-
ample. Without loss of generality the ﬁrst l points (1 · · · l) are labeled and the remaining

points (l + 1 · · · n) unlabeled. Define Y ∈ N n×c with Yij = 1 if point xi has label j and
0 otherwise. Let F ⊂ Rn×c denote all the matrices with nonnegative entries. A matrix
F ∈ F is a matrix that labels all points xi with a label yi = arg maxj≤c Fij . Define the
series F (t + 1) = αSF (t) + (1 − α)Y with F (0) = Y, α ∈ (0, 1). The entire algorithm
is defined as follows:
1. Form the affinity matrix Wij = exp(−xi − xj 2 /(2σ 2 )) if i = j and 0 other-
wise. σ determines how fast the distance function
n decays.
−1/2 −1/2
2. Compute S = D WD with Dii = j=1 Wij and Dij = 0, i = j.
3. Compute the limit of series limt→∞ F (t) = F ∗ = (I − αS)−1 Y . α ∈ (0, 1)
limits how much the information spreads from one point to the other.
4. Label each point xi as arg maxj≤c Fij∗ .
The regularization framework for this method follows. The cost function associated
with the matrix F with regularization parameter μ > 0 is defined as
⎛ ⎞
n 2 n
1 1 1
Q(F ) = ⎝ Wij √ Fi − Fj + μ Fi − Yi 2 ⎠ (1)
2 i,j=1 Dii Djj i=1

The first term is the smoothness constraint that associates a cost with change between
nearby points. The second term, weighted by μ, is the fitting constraint that associates
a cost for change from the initial assignments. The classifying function is defined as
1 μ
F ∗ = arg minF ∈F Q(F ). Differentiating Q(F ) one obtains F ∗ − 1+μ SF ∗ − 1+μ Y.
1 μ
Define α = 1+μ and β = 1+μ (note that α + β = 1 and the matrix (I − αS) is
non-singular) one can obtain

F ∗ = β (I − αS)−1 Y (2)

tain the closed form expression F ∗ see [4].

An unlabeled point is assigned to the class with the highest value in its row of F ∗ , a
n × c matrix. Note that the label assignment for each point depends on the initial marked
points chosen and the parameters σ and α. In most cases one of the columns in F ∗ is
signiﬁcantly larger than any other value for this point indicating a clear vote for one
class. Since this depends on the parameters chosen and it is not obvious how to choose
the parameters, we obtained several sets of labels by varying σ which deﬁnes the local
neighborhood of a point.

1.3. Obtaining a Reﬁned Solution: Consensus Cases and Voting Procedure

To tackle the problem of hybrid case elimination we use a voting methodology to elim-
inate cases in which different algorithms produce a disagreement similar to [11] that
combines hierarchical and partitioning clusterings.
In this paper we adopted the following solution: First, we use Bagged K-Means
[16] to get a stable estimate of our cluster centers in the presence of outliers and hybrid
cases. To eliminate cases that are far away from the cluster centers, we use the centers

True labels Consistency Method − F* Nearest Neighbor

10 10 10

5 5 5

0 0 0

−5 −5 −5
0 5 0 5 0 5
Consistency Iteration t=1 Consistency Iteration t=5 Consistency Iteration t=10
10 10 10

5 5 5

0 0 0

−5 −5 −5
0 5 0 5 0 5
Consistency Iteration t=20 Consistency Iteration t=50 Consistency Iteration t=150
10 10 10

5 5 5

0 0 0

−5 −5 −5
0 5 0 5 0 5

Figure 1. Consistency Method: two labeled points per class (big stars) are used to label the remaining unla-
beled points with respect to the underlying cluster structure. F ∗ denotes the convergence of the series.

KMeans Consistency alpha=0.9

5
5

4 4
3 3

Consistency alpha=0.8 KMeans + Consistency (Agreement)

Hybrid Cases

5 5

4
4
3 3

2 2
1 1

Figure 2. Toy example: Three Gaussians with hybrid cases in between them. Combining the labels assigned
by K-Means (top, left) and the Consistency Method (top, right; bottom, left) with two different σ results in the
removal of most of the hybrid cases (bottom, right) by requiring consensus between all models build.

in a semi-supervised setting with the consistency method [4] to obtain a second set of
labels. These labels from the semi-supervised method are obtained with a completely
different similarity measure than the K-Means labels. K-Means assigns labels by using
the distance to the cluster center (Nearest Neighbor) and works best given clusters that
are Gaussian. The semi-supervised consistency method assigns labels with respect to
the underlying intrinsic structure of the data and follows the shape of the cluster. The
semi-supervised labeling method minimizes (1) while K-Means attempts to minimize
the intra-cluster variance for each Cluster Ci and the respective cluster-mean μi , i.e.

k
V = xj − μi 2 ∀Ci .i = 1 . . . k (3)
i=1 xj ∈Ci

These two fundamentally different methods of label assignments are more likely to dis-
agree the farther away the point is from the cluster center. We eliminate cases in which
the labels do not agree. Note that the consistency method has been demonstrated to work
well on high-dimensional data such as images. On the other hand it has been demon-
strated that assignments of labels using Nearest Neighbor in high dimensional spaces are
often unusable [18].
The process is illustrated in Figure (2) with a toy example consisting of three Gaus-
sians and a couple of hybrid cases placed in between. For the purposes of this discussion
we labeled the five different groups of points. The three clusters are labeled as 1, 3 and
5. The hybrid cases are labeled as 2 and 4. We can see that the labeling resulting from
K-Means (upper left plot) and the consistency method differ (upper right, lower left).
The final voting solution (lower right) identifies hybrid cases that can then be removed 2 .
Using the method outlined above results in roughly half the cases of our data being
eliminated. The stability of these central core cases - as retained in the consensus model
- is shown by the almost identical matching of these core cases between the consensus
model and the bagged K-means solution (κ = .992, η = .994) and also to the original
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

K-means (κ = 0.949, η = 0.947).

2. Results

The clusters identiﬁed were Internalizing Youth A [2,19,20], Socialized Delinquents

[21,22,23], Versatile Offenders [2], Normal Accidental Delinquents [1], Internalizing
Youth B [2], Low-control Versatile Offenders [2,3] and Normative Delinquency [24]. All
the clusters relate to types that have been previously identified in various studies in the
Criminology literature, but were never identified at the same time in one data set using
clustering. A more detailed discussion of these types from a criminological point of view
can be found in [12]. A graphical representation of each cluster profile is illustrated in
Figure (3) showing the Z-Score centered around the mean.

2A color version of this ﬁgure for easier viewing is available at the following URL:
http://markus-breitenbach.com/figures/ecml_fig2.jpg

Class Label
Inputs 1 (n=83) 2 (n=103) 3 (n=85) 4 (n=151) 5 (n=197) 6 (n=146) 7 (n=130)
FamCrime
SubTrbl
Impulsiv
ParentConf
ComDrug
Aggress
ViolTol
PhysAbuse
PoorSuper
Neglect
AttProbs
SchoolBeh
InconDiscp
EmotSupp
CrimAssoc
YouthRebel
LowSES
FamilyDisc
NegCognit
Manipulate
HardDrug
SocIsolate
CrimOpp
EmotBonds
LowEmpath
Nhood
LowRemor
Promiscty
SexAbuse
LowProsoc
AcadFail
LowGoals

Figure 3. Resulting Cluster Means: Mean Plots of External Criminal History Measures Across Classes from
the Core Consensus Solution with Bootstrapped 95% Conﬁdence Limits.

Cluster 1. Internalizing Youth A: Withdrawn, Abused and Rejected . This present clus-
ter is dominated by extreme family abuse and an internalizing pattern of social with-
drawal, hostility and suspicion. These youths come from very poor families (LowSES)
that are highly disorganized (FamilyDiscontinuity) and have a history of high crime/drug
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

(FamililyCrime). There is also a history of serious abuse/neglect and an internalizing

withdrawn personality due to extreme physical abuse, sexual abuse, lacking of emotional
support, neglect, weak discipline and poor supervision. This abuse is accompanied by
serious social isolation/withdrawal, negative social cognitions/mistrust, a hostile attitude
to others. This cluster is similar to the types described in the studies of [2,19,20].
Cluster 2. Socially Deprived: Sub-cultural or Socialized Delinquents . This cluster ap-
pears to replicate the “lower class” or “socialized” delinquent often described in the so-
ciological literature [21,22,23]. A prototypical example is the “common sociopath” de-
scribed in [2]. The youth are from lower socio-economic families (LowSES) with above
average family disorganization (FamilyDiscontinuity). School performance is relatively
poor (AcadFail). This type has little evidence of sexual or physical abuse and is not in
rebellion against the parents. They do not show social withdrawal or hostile negative so-
cial attributions or aggression to others and no marked low-control personality features.
They have lower than average involvement in drugs or promiscuity.
Cluster 3. Low-Control A: Versatile Offenders . Cluster 3 matches “Primary Psy-
chopath” in [2] with high scores for impulsivity, low empathy, hostility, manipulative-

dominance, low remorse, criminal peers, high risk lifestyle, drug abuse and serious crim-
inal history. This type has little evidence of sexual or physical abuse and is in rebellion
against their parents. School performance is relatively poor (AcadFail) along with atten-
tion problems (AttProbs), disruptive school behaviors (SchoolBeh) and the youth have
few pro-social activities after school (LowProsocial). This cluster’s official criminal his-
tory coheres with the above extreme profile. This cluster has the highest mean number
for both adjudications and detentions compared to all others.
Cluster 4. Normal “Accidental/Situational” Delinquents . We found two clusters of
broadly normal youth (Clusters 4 and 7). Cluster 4 reflects mostly “normal” youth with
few risk factors. This benign pattern, plus their late age at first adjudication and mostly
minor delinquency, appears a good match for the AL type described in [1]. This type
scores lower than average on all the scales. Their personality pattern shows no clear
tendency towards low self control.
Cluster 5. Internalizing Youth B: With Positive Parenting . Cluster 5 and Cluster 1 both
exhibit the internalizing pattern of social withdrawal, isolation and mistrust. Both also
avoid delinquent peers, drugs, low adjudication rates and arguably belong in a single
large “internalizing” cluster. This cluster matches the “neurotic” offender category in [2].
This internalizing pattern (like Cluster 1) has above average negative social attributions
(NegCognit), hostile aggression (Aggress) and social withdrawal (SocIsolate). The so-
cial isolation is perhaps linked to a relatively low-risk lifestyle reflected by avoidance
of delinquent peers (CrimAssoc), common drugs (ComDrug), hard drugs (HardDrug)
and promiscuity. It profoundly differs from Cluster 2 by the presence of caring, com-
petent and non-abusive parents, who are not or neglectful and who do not shirk their
supervision. These families give little evidence of serious disorganization (FamilyDisc)
and have lower than average family crime/drugs (FamCrime) and low parental conflict
(ParentConf).
Cluster 6. Low-control B: Early Onset, Versatile Offenders with Multiple Risk Factors
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

. Cluster 6 is a more extreme variant of Cluster 3. This profile appears well matched to
“secondary psychopath” in [2] and “primary sociopath” in [3]. These youth score above
average on every scale. These youth follow a high risk lifestyle, associate with anti-social
peers and have the highest scores for soft drugs, hard drugs, drug related trouble (Sub-
Trbl) and promiscuity. Their personality shows above average impulsivity, manipulative-
dominance and tolerance of violence. At school they show disruptive school behavior,
attention problems but only moderately above average failure. Their families show pat-
terns of poor supervision and neglect, are in serious conflict with each other (ParentConf)
while the youth shows extreme rebellion against the parents (YouthRebel).
Cluster 7. Normative Delinquency: Drugs, Sex and Peers . This cluster, along with
Cluster 4, also reflects “normal” youth with substantial school and family strengths.
However, Cluster 7, unlike Cluster 4, has vulnerabilities to drugs (ComDrug, HardDrug,
SubTrbl), sex (Promiscty) and criminal peers (CrimAssoc). Their personality appears be-
nign with few signs of low self-control or social isolation. Their official record coheres
with this profile showing an older age at first arrest and mostly the normative deviance
that is widespread among most youth [24].

2.1. External Validation

External validation requires ﬁnding signiﬁcant differences between clusters on external

(but relevant) variables that were not used in cluster development. By comparing the
means and bootstrapped 95 percent confidence intervals of four external variables across
the seven clusters from the core consensus solution we identified those variables. The ex-
ternal variables include three criminal history variables (total adjudications, age-at-first
adjudication and total violent felony adjudications) and one demographic variable (age-
at-assessment). These plots show a number of significant differences in expected direc-
tions. For example, clusters 4 and 7, which both match the low risk profile of Moffitt’s
AL type [1] have significantly later age-at-first adjudication compared to the higher risk
cluster 6 that matches Moffitt’s high risk LCP and Lykken’s [2] Secondary Psychopath.
This latter cluster has the earliest age-at-first arrest and significantly higher total adjudi-
cations - which is consistent with Moffitt’s descriptions.
Finally, while our results indicate that boundary conditions of clusters are obviously
unreliable and fuzzy, the central tendencies or core membership appear quite stable. This
suggests that these high density regions contain sufficient taxonomic structure to sup-
port reliable identification of type membership for a substantial proportion of juvenile
offenders.
Using the method in Section 2 we were able to remove most of the hybrid cases. In
fact, the case removal was overly aggressive and removed roughly half the data set. How-
ever, the remaining cases were very interpretable on manual inspection and matched the
cluster profiles we had found previously. Our analysis also show that cluster boundaries
are relatively unstable. The values of Kappa, which are between 0.55 to 0.70, although
indicating general overlap, also imply that boundaries between clusters may be imposed
differently, and cases close to boundaries may be unreliably classified across adjacent
clusters. Many of these cases may be regarded as hybrids with many co-occurring risk
or needs and multiple causal influences. Lykken [2] recognized this by stating that many
offenders will have mixed etiologies and will be borderline or hybrid cases (p. 21).
The presence of hybrids and outliers appears unavoidable given the multivariate
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

complexity of delinquent behavior, the probabilistic nature of most risk factors and mul-
tiplicity of causal factors. Additionally, our findings on boundary conditions and non-
classifiable cases must remain provisional since refinements to our measurement space
may reduce boundary problems. Specifically, it is known that the presence of noise and
non-discriminating variables can blur category boundaries [13]. Further research may
clarify the discriminating power of all classification variables (features) and gradually
converge on a reduced space of only the most powerful features.

2.2. Classiﬁcation

Since we now have a labeled dataset we now examine how well a classifier is able to
discriminate between the classes we have identified. We use a linear support vector ma-
chine in a one-against-one setting to build classifiers for the seven classes in the data set.
We use 10-fold cross-validation and random 90%/10% split to determine an estimate
of the error for future unseen data. In Table 1 we can see that a linear SVM can easily
discriminate between the classes with more than 90 percent accuracy.

Table 1. Cross validation results with a linear support vector machine on the COMPAS data (93.6387%
correct using cross-validation; 92% correct using 90/10 splits estimates).
a b c d e f g ← classiﬁed as
133 2 1 0 6 2 0 a=1
3 248 4 3 3 3 2 b=2
0 4 224 0 1 3 3 c=3
0 7 0 248 3 0 6 d=4
4 5 3 1 268 0 7 e=5
1 2 3 0 0 171 0 f=6
0 2 3 4 8 1 180 g=7

2.3. Replication

To verify our findings and ensure that the clusters we have found are not artefacts of our
sample, we used a cross-replication and cluster validation design proposed by MacIntyre
and Blashfield [25,26]. This method requires that one repeats the original analysis on a
replication sample (B) using identical methods. Furthermore, the cases of the replication
sample are assigned to the clusters of the original sample using a classification procedure.
The similarity of the two assignments is then compared in cross-tabulations.
For our purposes we were using Support Vector Machine [27] models we trained on
the labels we obtained for the original sample in a 1-vs-all along with a second model
trained in a 1-vs-1 setting. Both models have an equally good classification performance,
and in order to avoid erroneous assignments we require that both models agree on the
class.
The replication sample (B) of 1,453 youth was assessed using identical instruments.
This sample consisted of successive admissions to juvenile assessment centers at four
urban judicial districts in a western state that was not included in the training sample.
The sample is 67% male delinquents. The average age is 15.6 years (SD = 1.6) and
ranges from 9.0 years to 18.0 years. The ethnicity breakdown in the sample is 54.2%
Caucasian, 17.5% African American, 23.9% Latino/a, and 4.4% other ethnic groups.
Approximately 70% of these youth had entered the juvenile assessment centers after an
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

arrest for a misdemeanor or felony offense, while the remainder were brought in for
others reasons, including status offenses, school referrals, and family issues. Fifty-five
percent of the sample had no adjudications. Sample B contains fewer serious delinquents
than the original sample A.
The overall comparison of the initial cluster solution (B2) and replication solu-
tion (B1) produced a strong and significant global relation (Contingency Coefficient =
0.84, p > .0001). Yet, some differences did exist. The specific matching was not always
exact and the initial pattern 6 completely failed to replicate. The absence of cluster 6
prevented the computation of a kappa coefficient due to different numbers of classes in
B1 and B2. The missing cluster is understandable because the most serious delinquents
would be unlikely to be referred to the juvenile assessment centers.
The SVM classification indicated that overall 70% of the replication sample was
assigned to one of the original seven patterns. The most frequent cluster for both boys and
girls was the relatively low-risk/low-delinquency cluster 4. Cluster 6 (the most serious
delinquent profile) was the least frequent, with only 1.6% for boys and 1.9% for girls.
Additionally, all seven of the original clusters were recovered by the SVM; although,
as noted, very few cases in the replication sample matched cluster 6. Overall, 30% of

the replication sample failed to meet the matching criteria of the SVM and remained
unclassiﬁed.

2.4. Limitations

The present research has limitations and could be extended in several directions. Our
sample while large and fairly heterogeneous was limited to two state jurisdictions (Geor-
gia and North Dakota) and one county jurisdiction (Ventura, California). Additionally,
our sample did not cover the entire spectrum of juvenile justice agencies but was domi-
nated by committed youth. These sample characteristics limit the generalizability of our
findings.
The selection of taxonomic methods is also a difficult issue and several alternative
approaches are possible. We utilized only a limited set of potentially appropriate pattern
seeking methods. Nagin and Paternoster [28] acknowledge there is no clear consensus
on the most appropriate methods to study population heterogeneity and suggest that re-
searchers should explore different methods with different assumptions. Alternatives in-
clude several families of cluster analysis, latent class models, Meehl’s [29] taxometric
methods and semi-parametric mixed Poisson models [30,13,28]. We adopted this sug-
gestion by using several classical density-seeking methods and a more recent method
embodying different mathematical assumptions to identify pattern structure.
Another methodological limitation is the unresolved challenge of finding an optimal
value of K. [31] list over 30 different approaches to this problem. Ultimately, as in many
recent studies [32], we relied on a combination of methods, as well as interpretative
clarity. The K = 7 solution is tentative, and we acknowledge that the more parsimonious
K = 5 solution (not discussed here) may have advantages.
A perennial difficulty in any taxonomic study is the selection (coverage) and focus
of classification factors or classification space. Any specific selection inevitably imposes
a limitation on the knowledge claims and inferences that can be made regarding the re-
sulting types - and will inevitably omit other explanatory perspectives. In contrast several
prior studies adopted a broad holistic person-centered strategy, recommended by Mag-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

nusson [33], by using comprehensive multivariate coverage of key factors. Our present
approach to selecting features was guided by several current theories of delinquency and
the extant taxonomic literature, and included a spectrum of family, peer, school, com-
munity, cognition and personality domains. A key omission may be our limited cov-
erage of mental health factors. The distinctions between our two internalizing clusters
could perhaps gain from a deeper assessment of mental health issues. A current study
(in progress) has added mental health assessment to the current measurement space. Pre-
liminary results show that depression and suicide risk, as expected, correlate highly with
social isolation and negative social cognition scales.
A related methodological challenge is that irrelevant or poorly discriminating vari-
ables inevitably add noise and may blur boundaries between clusters [13]. This issue
was recently reviewed and new methodological approaches to its resolution were offered
[34]. In ongoing research we are exploring these and other alternatives to the possible
reﬁnement of this classiﬁcation space.
In conclusion, we agree with both Nagin and Paternoster [28] and Lykken [2] that
we are still at early stages in mapping the taxonomic heterogeneity of delinquency – from
both behavioral and explanatory perspectives. Although this study has produced several

replications and extensions of the prior taxonomic research in delinquency, it has also
revealed some of the complexities of both the vertical and horizontal structures regarding
delinquency.

3. Conclusion

In this paper we report on several difficult issues in finding clusters in a large sample
of delinquent youth using the Youth COMPAS assessment instrument. This instrument
contains 32 social and psychological scales that are widely used in assessing criminal
and delinquent populations.
Cluster analysis methods (Ward’s method, standard k-means, bagged k-means and
a semi-supervised pattern learning technique) were applied to the data. Cross-method
verification and external validity were examined. Core or exemplar cases were identified
by means of a voting (consensus) procedure. Seven recurrent clusters emerged across
replications.
The clusters identified were Internalizing Youth A [2,19,20]. Socialized Delinquents
[21,22,23], Versatile Offenders [2]. Normal Accidental Delinquents [1]. Internalizing
Youth B [2], Low-control Versatile Offenders [2,3] and Normative Delinquency [24].
Each of these clusters was found to relate fairly clearly to types previously identified in
various studies in the criminology literature, but had never been identified at the same
time in one data set using clustering methods. Additionally, the present analysis provides
a more complete set of empirical descriptions for these recurring types than offered in any
previous studies. This is the first study in which most of the well replicated patterns were
identified purely from the data using unsupervised learning and clustering methods. Most
prior studies provide only partial theoretical or clinical descriptions, omit operational
type-identification procedures and offer only a limited coverage of the critical features.
In this project we introduced a novel way of hybrid-case elimination in an unsuper-
vised setting. Although we are still working on establishing a more theoretical founda-
tion of this approach it has given results that are readily recognized and interpreted by
delinquency counselors in applied juvenile justice settings. Following the establishment
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

of these clusters a classiﬁer was developed from the data to efﬁciently classify new cases.
A further methodological lesson was that the initial solution obtained using an elaborate
outlier removal process using Ward’s linkage and regular K-Means was easily replicated
using the Bagged K-Means without outlier removal or other “manual” operations. The
present project has suggested that Bagged K-Means appears to be very robust against
noise and outliers.

References

[1] T. E. Mofﬁtt. Adolescence-limited and life-course persistent antisocial behavior: A developmental tax-
onomy. Psychological Review, 100(4):674–701, 1993.
[2] D. Lykken. The Antisocial Personalities. Lawrence Erlbaum, Hillsdale, N.J., 1995.
[3] L. Mealey. The sociobiology of sociopathy: An integrated evolutionary model. Behavioral and Brain
Sciences, 18(3):523–599, 1995.
[4] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, and B. Schölkopf. Learning with local and global con-
sistency. In L. Saul S. Thrun and B. Schölkopf, editors, Advances in Neural Information Processing
Systems 16, Cambridge, Mass., 2004. MIT Press.

[5] D.P. Farrington. Integrated Developmental and Life-Course Theories of Offending. Transaction Pub-
lishers,London, 2005.
[6] A. R. Piquero and T.E. Moffitt. Integrated Developmental and Life-Course Theories of Offending,
chapter Explaining the facts of crime: How developmental taxonomy replies to Farrington’s Invitation.
Transaction Publishers,London, 2005.
[7] D.W. Osgood. Making sense of crime and the life course. Annals of AAPSS, 602:196–211, 2005.
[8] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum-likelihood from incomplete data via the em algo-
rithm. Journal of the Royal Statistical Society, 39, 1977.
[9] Evgenia Dimitriadou, Andreas Weingessel, and Kurt Hornik. Voting-merging: An ensemble method for
clustering. In Lecture Notes in Computer Science, volume 2130, page 217. Springer Verlag, Jan 2001.
[10] Alexander P. Topchy, Anil K. Jain, and William F. Punch. Combining multiple weak clusterings. In
Proceedings of the ICDM, pages 331–338, 2003.
[11] Cheng-Ru Lin and Ming-Syan Chen. Combining partitional and hierarchical algorithms for robust and
efficient data clustering with cohesion self-merging. In IEEE Transactions on Knowledge and Data
Engineering, volume 17, pages 145 – 159, 2005.
[12] T. Brennan, M. Breitenbach, and W. Dieterich. Towards an explanatory taxonomy of adolescent
delinquents: Identifying several social-psychological profiles. Journal of Quantitative Criminology,
24(2):179–203, 2008.
[13] G. W. Milligan. Clustering and Classification, chapter Clustering validation: Results and implications
for applied analyses., pages 345–379. World Scientific Press, River Edge, NJ, 1996.
[14] J. Han and M. Kamber. Data Mining - Concepts and Techniques. Morgan Kauffman, San Francisco,
2000.
[15] L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
[16] S. Dolnicar and F. Leisch. Getting more out of binary data: Segmenting markets by bagged clustering.
Working Paper 71, SFB ‘Adaptive Information Systems and Modeling in Economics and Management
Science”, August 2000.
[17] R Development Core Team. R: A language and environment for statistical computing. R Foundation for
Statistical Computing, Vienna, Austria, 2004. 3-900051-07-0.
[18] Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is ‘nearest neighbor”
meaningful? Lecture Notes in Computer Science, 1540:217–235, 1999.
[19] M. Miller, D. Kaloupek, A. Dillon, and T. Keane. Externalizing and internalizing subtypes of combat-
related PTSD: A replication and extension using the PSY-5 scales. Journal of Abnormal Psychology,
113(4):636–645, 2004.
[20] A. Raine, T. E. Moffitt, and A. Caspi. Neurocognitive impairments in boys on the life-course persistent
antisocial path. Journal of Abnormal Psychology, 114(1):38–49, 2005.
[21] W. Miller. Lower-class culture as a generating milieu of gang delinquency. Journal of Social Issues,
14:5–19, 1958.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[22] C. F. Jesness. The Jesness Inventory Classification System. Criminal Justice and Behavior, 15(1):78–91,
1988.
[23] M. Q. Warren. Classification of offenders as an aid to efficient management and effective treatment.
Journal of Criminal Law, Criminology, and Police Science, 62:239–258, 1971.
[24] T. E. Moffitt A., Caspi, M. Rutter, and P. A. Silva. Sex Differences in Antisocial Behaviour. Cambridge
University Press, Cambridge,Mass., 2001.
[25] R. McIntyre and R. Blashfield. A nearest-centroid technique for evaluating the minimum-variance clus-
tering procedure. Multivariate Behav Research, 15(2):225–238, 1980.
[26] A. Gordon. Classification. Chapman and Hall, New York, 1999.
[27] V. Vapnik. The Nature of Statistical Learning Theory. Wiley, NY, 1998.
[28] D. Nagin and Raymond Paternoster. Population heterogeneity and state dependence: State of the evi-
dence and directions for future research. Journal Of Quantitative Criminology, 16(2):117–144, 2000.
[29] P. Meehl and L.J. Yonce. Taxometric analysis. i: Detecting taxonicity with two quantitative indicators
using means above and below a sliding cut (mambac procedure). Psychological reports, 74, 1994.
[30] T. Brennan. Classification: An overview of selected methodological issues. In D. M. Gottfredson and
M. Tonry, editors, Prediction and Classification: Criminal Justice Decision Making, pages 201–248.
University of Chicago Press, Chicago, 1987.
[31] G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters
in a data set. Psychometrika, 50:159–79, 1985.

[32] P. T Costa, J.H Herbst, R.R. McCrae, J. Samuels, and D. J. Ozer. The replicability and utility of three
personality types. European Journal of Personality, 16:73–87, 2002.
[33] D. Magnusson. The individual as an organizing principle in psychological inquiry: A holistic approach.
In L-G Nillson Lars R. Bergman, R.B Cairns and L. Nystedt, editors, Developmental Science and the
Holistic Approach, pages 33–47. Lawrence Erlbaum, Mahwah: New Jersey, 2000.
[34] A.E. Raftery and Nema Dean. Variable selection for model-based clustering. Technical Report 452,
Department of Statistics, University of Washington, May 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Forecasting Online Auctions using

Dynamic Models
Wolfgang JANK 1 and Galit SHMUELI
Department of Decisions, Operations and Information Technologies
The Robert H. Smith School of Business
University of Maryland

Abstract. We propose a dynamic forecasting model for price in online auctions.

One of the key features of our model is that it operates during the live-auction, gen-
erating real-time forecasts which makes it different from previous static models.
Our model is also different with respect to how information about price is incor-
porated. While one part of the model is based on the more traditional notion of an
auction’s price-level, another part incorporates its dynamics in the form of price-
velocity and -acceleration. In that sense, it incorporates key features of a dynamic
environment such as an online auction. The use of novel functional data methodol-
ogy allows us to measure, and subsequently include, dynamic price characteristics.
We illustrate our model on a diverse set of eBay auctions across many different
book categories. It achieves signiﬁcantly higher prediction accuracy compared to
standard approaches.

Keywords. Functional data analysis, forecasting, dynamics, online auctions, eBay

eBay (www.eBay.com) is the the world’s largest Consumer-to-Consumer (C2C) online

auction house. There are approximately 44 million items available worldwide on eBay
at any given time and approximately 4 million new items are added every day in over
50,000 categories. On eBay.com, an identical (or near-identical) product is often sold
in numerous, often simultaneous auctions. For instance, a simple search under the key
words “iPod shufﬂe 512MB MP3 player" returns over 300 hits for auctions that close
within the next 7 days. A more general search under the less restrictive key words “iPod
MP3 player" returns over 3,000 hits. Clearly, it would be challenging, even for a very
dedicated eBay user, to inspect and simultaneously monitor all of these 300 (or 3,000)
auctions, while being on the look-out for newly added auctions for the same product, and
subsequently deciding in which of these numerous auctions to participate and to place a
bid.
The decision making process of an eBay bidder can be supported by price forecasts.
With the availability of a price forecasting system, one can create an auction-ranking
(from lowest predicted price to highest) and select those auctions with the lowest pre-
1 Corresponding author: Department of Decisions, Operations and Information Technologies, The Robert H.

Smith School of Business, University of Maryland, College Park, MD 20742; E-mail: [email protected]
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
138 W. Jank and G. Shmueli / Forecasting Online Auctions Using Dynamic Models

dicted price. One of the difficulties with such an approach is that information in the on-
line environment changes constantly: new auctions enter the market, old (i.e. closed)
auctions drop out, and even within the same auction the price changes continuously with
every new incoming bid. Thus, a well-functioning forecasting system must be adaptive
to accommodate a constantly changing environment.
We propose a dynamic forecasting model that can adapt to change. In general, price
forecasts can be done in two different ways, in a static or in a dynamic way. The static
approach relates information that is known before the start of the auction to information
that becomes available after the auction closes. This is the basic principle of several
existing models [1,2,3,4]. For instance, one could relate the opening bid, the auction
length and a seller’s reputation to the final price. Notice that opening bid, auction length,
and seller reputation are all known at the auction start. Training a model on a suitable set
of past auctions, one can obtain static forecasts of the final price in that fashion. However,
this approach does not take into account important information that arrives during the
auction. The current number of competing bidders or the current price level are factors
that are only revealed during the ongoing auction and that are important in determining
the future price. Moreover, the current change in price also has a huge impact on the
future price. If, for instance, the price had increased at an extremely fast rate over the
last several hours, causing bidders to drop out of the bidding process or to revise their
bidding strategies, then this could have an immense impact on the evolution of price in
the next few hours and, subsequently, on the final price. We refer to models that account
for newly arriving information and for the rate at which this information changes as
dynamic models.
Dynamic price forecasting in online auctions is challenging for a variety of reasons.
Traditional methods for forecasting time-series, such as exponential smoothing or mov-
ing averages, cannot be applied in the auction context, at least not directly, due to the
special data structure. Traditional forecasting methods assume that data arrive in evenly-
spaced time intervals such as every quarter or every month. In such a setting, one trains
the model on data up to the current time period t, and then uses this model to predict
at time t + 1. Implied in this process is the assumption that the distance between two
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

adjacent time periods is equal, which is the case for quarterly or monthly data. Now
consider the case of online auctions. Bids arrive in very unevenly-spaced time intervals,
determined by the bidders and their bidding strategies, and the number of bids within a
short period of time can sometimes be very sparse, while othertimes extremely dense. In
this setting, the distance between t and t + 1 can sometimes be more than a day, while at
other times it may only be a few seconds. Traditional forecasting methods also assume
that the time-series continues, at least in theory, for an inﬁnite amount of time and does
not stop at any point in the near future. This is clearly not the case in a 5- or 7-day online
auction. The implication of this is a discrepancy in the estimated forecasting uncertainty.
And lastly, online auctions, even for the same product, can experience price paths with
very heterogeneous price dynamics [5,6]. By price dynamics we mean the speed at which
price travels during the auction and the rate at which this speed changes. Traditional
models do not account for instantaneous change and its effect on the price forecast. This
calls for new methods that can measure and incorporate this important information.
In this work we propose a new approach for forecasting price in online auctions.
The approach allows for dynamic forecasts in that it incorporates information from the
ongoing auction. It accommodates the unevenly spacing of data, and also incorporates

change in the price dynamics. Our forecasting approach is housed within the principles
of functional data analysis [7]. In Section 1 we explain the principles of functional data
analysis and derive our functional forecasting model in Section 2. We apply our model
to a set of bidding data for a variety of book auctions in Section 3 . We conclude with
further remarks in Section 4.

1. Functional Data Models

The technological advancements in measurement, collection, and storage of data have

led to more complex data-structures. Examples include measurements of individuals’
behavior over time, digitized 2- or 3-dimensional images of the brain, and recordings
of 3- or even 4-dimensional movements of objects travelling through space and time.
Such data, although recorded in discrete fashion, can be thought of as continuous objects
represented by functional relationships. This gives rise to the ﬁeld of functional data
analysis (FDA) where the center of interest is a set of curves, shapes, objects, or, more
generally, a set of functional observations. This is in contrast to classical statistics where
the interest centers around a set of data vectors. In that sense, functional data are not only
different from the data-structure studied in classical statistics, but actually generalize it.
Many of these new data-structures call for new statistical methods in order to unveil the
information that they carry.

1.1. The Price Curve and its Dynamics

A functional data set consists of a collection of continuous functional objects such as the
price paths in an online auction. Despite their continuous nature, limitations in human
perception and measurement capabilities allow us to observe these curves only at discrete
time points. Thus, the ﬁrst step in a typical functional data analysis is to recover (or
estimate), from the observed data, the underlying continuous functional object [7]. This
is usually done with the help of data smoothing.
A variety of different smoothing methods exist. One very ﬂexible and computational
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

efﬁcient choice is the penalized smoothing spline [8]. Let τ1 , . . . , τL be a set of knots.
Then, a polynomial spline of order p is given by

L
f (t) = β0 + β1 t + · · · + βp tp + βpl (t − τl )p+ , (1)
l=1

where u+ = uI[u≥0] denotes the positive part of the function u. Deﬁne the roughness
penalty

PENm (t) = {Dm f (t)}2 dt, (2)

where Dm f , m = 1, 2, 3, . . . , denotes the mth derivative of the function f . The penal-

ized smoothing spline f minimizes the penalized squared error

PENSSλ,m = {y(t) − f (t)}2 dt + λPENm (t), (3)

where y(t) denotes the observed data at time t and the smoothing parameter λ controls
the trade-off between data-fit and smoothness of the function f . Using m = 2 in (3)
leads to the commonly encountered cubic smoothing spline. Other possible smoothers
include the use of B-splines or radial basis functions [8].
The choice of the knots influences the resulting smoothing spline. Our goal is to
obtain smoothing splines that represent, as much as possible, the price formation process.
To that end, our selection of knots mirrors the distribution of bid arrivals [9]. We also
choose the smoothing parameter λ to balance data-fit and smoothness [10].
The process of going from observed data to functional data is now as follows. For a
set of n functional objects, let tij denote the time of the jth observation (1 ≤ j ≤ ni ) on
the ith object (1 ≤ i ≤ n), and let yij = y(tij ) denote the corresponding measurements.
Let fi (t) denote the penalized smoothing spline fitted to the observations yi1 , . . . , yini .
Then, functional data analysis is performed on the continuous curves fi (t) rather than on
the noisy observations yi1 , . . . , yini . For ease of notation we will suppress the subscript
i and write yt = f (t) for the functional object and D(m) yt = f (m) (t) for its mth
derivative.
Consider Figure 1 for illustration. The circles in the top panel of Figure 1 correspond
to a scatterplot of the bids (on log-scale) versus their timing. The continuous curve in
the top panel shows a smoothing spline of order m = 4 using a smoothing parameter
λ = 50.
One of our modeling goals is to capture the dynamics of an auction. While yt de-
scribes the magnitude of the current price, it does not reveal the dynamics of how fast
the price is changing or moving. Attributes that we typically associate with a moving
object are its velocity (or its speed) as well as its acceleration. Note that we can compute
the price velocity and price acceleration via the first and second derivatives, D(1) yt and
D(2) yt , respectively.
Consider again Figure 1. The middle panel corresponds to the price velocity, D(1) yt .
Similarly, the bottom panel shows the price acceleration, D(2) yt . The price velocity has
several interesting features. It starts out at a relatively high mark which is due to the
starting price that the first bid has to overcome. After the initial high speed, the price
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

increase slows down over the next several days, reaching a value close to zero mid-
way through the auction. A close-to-zero price velocity means that the price increase is
extremely slow. In fact, there are no bids between the beginning of day 2 and the end of
day 4 and the price velocity reflects that. This is in stark contrast to the price increase on
the last day where the price velocity picks up pace and the price jumps up!
The bottom panel in Figure 1 represents price acceleration. Acceleration is an im-
portant indicator of dynamics since a change in velocity is preceded by a change in accel-
eration. In other words, a positive acceleration today will result in an increase of velocity
tomorrow. Conversely, a decrease in velocity must be preceded by a negative acceler-
ation (or deceleration). The bottom panel in Figure 1 shows that the price acceleration
is increasing over the entire auction duration. This implies that the auction is constantly
experiencing forces that change its price velocity. The price acceleration is flat during the
middle of the auction where no bids are placed. With every new bid, the auction experi-
ences new forces. The magnitude of the force depends on the size of the price-increment.
Smaller price-increments will result in a smaller force. On the other hand, a large number
of small consecutive price-increments will result in a large force. For instance, the last 2
bids in Figure 1 arrive during the final moments of the auction. Since the increments are

Current Price

3.7
Log−Price
3.5
3.3

0 1 2 3 4 5 6 7
Day of Auction

Price Velocity
First Derivative of Log−Price
0.00 0.05 0.10 0.15

0 1 2 3 4 5 6 7
Day of Auction

Price Acceleration
Second Derivative of Log−Price
0.00 0.04
−0.06

0 1 2 3 4 5 6 7
Day of Auction

Figure 1. Current price, price velocity (first derivative) and price acceleration (second derivative) for a selected
auction. The first graph shows the actual bids together with the fitted curve.

of auction dynamics has been done in other places [5,6].

2. Dynamic Forecasting Model

As pointed out earlier, the goal is to develop a dynamic forecasting model. By dynamic
we mean a model that operates in the live-auction and forecasts price at a future time
point of the ongoing auction. This is in contrast to a static forecasting model which makes
prediction only about the ﬁnal price, and which takes into consideration only information
available before the start of the auction. Consider Figure 2 for illustration. Assume that
we observe the price path from the start of the auction until time t (solid black line).
We now want to forecast the continuation of this price path (broken grey lines, labelled
"A", "B", and "C"). The difﬁculty in producing this forecast is the uncertainty about the
price dynamics in the future. If the dynamics level-off, then the price increase will slow
down and we might see a price path similar to A. If the dynamics remain steady, the price
path might look more like the one in B. Or, if the dynamics sharply increase, then a path

$100 - C

Predicted Price Path

P A

r
i $50 -
c
e
Observed Price Path

$0 -
1 2 3
Day
4 5
t 6 7

Figure 2. Schematic of the dynamic forecasting model of an ongoing auction.

like the one in C could be the consequence. Either way, knowledge of the future price
dynamics appears is a key factor!
Our dynamic forecasting model consequently consists of two parts: First, we develop
a model for price dynamics. Then, using estimated dynamics, together with other relevant
covariates, we derive an econometric model of the ﬁnal price and use it to forecast the
outcome of an auction.

2.1. Modeling and Forecasting Dynamics

We pointed out earlier that one of the main characteristics of online auctions is their
rapid change in dynamics. Since change in the p + 1st derivative precedes change in the
pth derivative (e.g. change in acceleration precedes change in velocity), we make use of
derivative information for forecasting. In the following, we develop a model to estimate
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

and forecast an auction’s price dynamics.

Let again D(m) yt denote the mth derivative of the price yt at time t. We model the
derivative curve as a polynomial in time t with autoregressive (AR) residuals,

D(m) yt = a0 + a1 t + · · · + ak tk + αx(t) + ut , (4)

where x(t) is a vector of covariates, α is a corresponding vector of parameters, and ut

follows an autoregressive model of order p:

ut = φ1 ut−1 + φ2 ut−2 + · · · + φp ut−p + εt , (5)

εt ∼ N (0, σ 2 ).
We allow (4) to depend on the vector x(t), which results in a very ﬂexible model that can
accommodate different dynamics due to differences in the auction format or the product
category.
We estimate model (4) from the training sample. Estimation is done in two steps:
First, we estimate the parameters a1 , . . . , ak and α. Then, using the estimated residuals
ût , we estimate φ1 , . . . , φp .

Forecasting is also done in two steps. Let 1 ≤ t ≤ T denote the observed time
period and let T + 1, T + 2, T + 3, . . . denote time periods we wish to forecast. We ﬁrst
forecast the next residual via

ũT +1|T = φ˜1 uT + φ˜2 uT −1 + · · · + φ˜p uT −p+1 . (6)

Using this forecast, we can predict the derivative at the next time point T + 1 via

D(m) ỹT +1|T = (7)

â0 + â1 (T + 1) + · · · + âk (T + 1)k + α̂x(T + 1) + ũT +1|T .

In a similar fashion, we can predict the derivative l steps ahead:

D(m) ỹT +l|T = (8)

â0 + â1 (T + l) + · · · + âk (T + l)k + α̂x(T + l) + ũT +l|T

2.2. Modeling and Forecasting Price

After forecasting the price dynamics, we use these forecasts to predict the auction
price over the next time periods up to the auction end. Many factors can affect the
price in an auction such as information about the auction format, the product, the
bidders and the seller. Let x(t) denote the vector of all such factors. Let d(t) =
(D(1) yt , D(2) yt , . . . , D(p) yt ) denote the vector of price dynamics, i.e. the vector of the
ﬁrst p derivatives of y at time t. The price at t can be affected by the price at t − 1 and
potentially also by its values at times t − 2, t − 3, etc. Let l(t) = (yt−1 , yt−2 , . . . , yt−q )
denote the vector of the ﬁrst q lags of yt . We then write the general dynamic forecasting
model as follows:

yt = βx(t) + γd(t) + δl(t) + t (9)

where β, γ and δ denote the parameter vectors and t ∼ N (0, σ 2 ). We use the estimated
model (9) to predict the price at T + l as

ỹT +l|T = β̂x(T + l) + γ̂d(T + l) + δ̂l(T + l) (10)

3. Empirical Results

3.1. Data

Our data set is diverse and contains 768 eBay book auctions from October 2004. All auc-
tions were 7 days long and span a variety of categories (see Table 1). Prices range from
$0.10 to $999 and are, not unexpectedly, highly skewed. Prices also vary signiﬁcantly
across the different book categories. This data set is challenging due to its diversity in
products and price. We use 70% of these auctions (or 538 auctions) for training purposes.
The remaining 30% (or 230 auctions) are kept in the validation sample.

Table 1. Categories of 768 book auctions. The second column gives the number of auctions per category. The
third and fourth column show average and standard deviation of price per category.

3.2. Estimated Model

Our model building investigations suggest that among all price dynamics only velocity
f (t) is significant for forecasting price in our data. We thus estimate model D(m) yt
in (4) only for m = 1. We do so in the following way. Using a quadratic polynomial
(k = 2) in time t and influence-weighted [10] predictor variables for book-category
(x̃1 (t)) and shipping costs (x̃2 (t)) results in an AR(1) process for the residuals ut (i.e.
p = 1 in (5)). The rationale behind using book-category and shipping costs in model
(4) is that we would expect the dynamics to depend heavily on these two variables. For
instance, the category of antiquarian and collectible books typically contains items that
are of rare nature and that appeal to a market that is not very price sensitive and with
a strong interest in obtaining the item. This is also reflected in the large average price
and even larger variability for items in this category (Table 1). The result of these market
differences may well be a different price evolution and thus different price dynamics. A
similar argument applies to shipping costs. Shipping costs are determined by the seller
and act as a "hidden" price premium. Bidders are often deterred by excessively high
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

shipping costs and as a consequence auctions may experience differences in the price
dynamics. Table 2 summarizes the estimated coefﬁcients averaged across all auctions
from the training set. We can see that both book-category and shipping costs result in
signiﬁcantly different price dynamics.

Table 2. Estimates for the velocity model D (1) yt in (4). The second column reports the estimated parameter
values and the third column reports the associated signiﬁcance levels. Values are averaged across the training
set.

Predictor Coeff P-Val

Intercept 0.041 0.004
t -0.012 0.055
t2 0.004 0.041
Book Category x̃1 (t) 1.418 0.038
Shipping Costs x̃2 (t) 1.684 0.036
ut 1.442 -

After modeling the price dynamics we estimate the price forecasting model (9). Re-
call that (9) contains three model components, x(t), d(t) and l(t). Among all reasonable
price-lags only the first lag is influential, so we have l(t) = yt−1 . Also, as mentioned
earlier, among the different price dynamics we only find the velocity to be important,
so d(t) = D(1) yt . The first two rows of Table 3 display the corresponding estimated
coefficients.
Note that both l(t) and d(t) are predictor variables derived from price, either
from its lag or from its dynamics. We also use 8 non-price related predictor variables
x(t) = (x1 (t), x2 (t), x3 (t), x̃4 (t), x̃5 (t), x̃6 (t), x̃7 (t), x̃8 (t))T . Specifically, the 8 pre-
dictor variables correspond to the average rating of all bidders until time t (which we
refer to as the current average bidder rating at time t and denote as x1 (t)), the current
number of bids at time t (x2 (t)), and the current winner rating at time t (x3 (t)). These
first 3 predictor variables are time-varying. We also consider 5 time-constant predictors:
the opening bid (x̃4 (t)), the seller rating (x̃5 (t)), the seller’s positive ratings (x̃6 (t)), the
shipping costs (x̃7 (t)), and the book category (x̃8 (t)), where x̃i (t) again denotes the
influence-weighted variables.
Table 3 shows the estimated parameter values for the full forecasting model. It is
interesting to note that book-category and shipping costs have low statistical significance.
The reason for this is that their effects have likely already been captured satisfactorily in
the model for the price velocity. Also notice that the model is estimated on the log-scale
for better model fit. That is, the response yt and all numeric predictors (x̃1 (t), . . . , x̃7 (t))
are log-transformed. The implication of this lies in the interpretation of the coefficients.
For instance, the value 0.051 implies that for every 1% increase in opening bid, the price
increases by about 0.05%, on average.

Table 3. Estimates for the price forecasting model (9). The ﬁrst column indicates the part of the model design
that the predictor is associated with. The third column reports the estimated parameter values and the fourth
column reports the associated signiﬁcance levels. Values are again averaged across the training set.

Des Predictor Coeff P-Val

l(t) Price Lag yt−1 4.824 0.044

x(t) Intercept 5.909 0.110
x(t) Cur.Avg.Bid.Rating x1 (t) 0.414 0.012
x(t) Cur.Numb.Bids x2 (t) -0.008 0.027
x(t) Cur.Win.Rating x3 (t) 0.197 0.027
x(t) Opening Bid x̃4 (t) 0.051 0.031
x(t) Seller Rating x̃5 (t) -11.534 0.070
x(t) Pos Seller Rating x̃6 (t) 1.518 0.093
x(t) Shipping Cost x̃7 (t) 0.008 0.215
x(t) Book Category x̃8 (t) 3.950 0.107

3.3. Forecasting Accuracy

We estimate the forecasting model on the training data and use the validation data to
investigate its forecasting accuracy. To that end we assume that for the 230 auctions in

MAPE

0.4
forecasting model
exponential smoothing

0.3
0.2
0.1

6.2 6.4 6.6 6.8 7.0

Figure 3. Mean absolute percentage error (MAPE) of the forecasted price over the last auction-day. The solid
line corresponds to our dynamic forecasting model; the dashed line correspond to double exponential smooth-
ing. The x-axis denotes the day of the auction.

the validation data we only observe the price until day 6 and we want to forecast the
remainder of the auction. We forecast price over the last day in small increments of 0.1
days. That is, from day 6 we forecast day 6.1, or the price after the ﬁrst 2.4 hours of
day 7. From day 6.1 we forecast day 6.2 and so on until the auction-end, at day 7. The
advantage of a sliding-window approach is the possibility of feedback-based forecast
improvements. That is, as the auction progresses over the last day, the true price level
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

can be compared with its forecasted level and deviations can be channelled back into the
model for real-time forecast adjustments.
Figure 3 shows the forecast accuracy on the validation sample. We measure fore-
casting accuracy using the mean absolute percentage error (MAPE), that is
230
1 (Predicted Pricet,i − True Pricet,i )
MAPEt = ,
230 True Price t,i

i=1

i = 1, . . . , N ; t = 6.1, . . . , 7,

where i denotes the ith auction in the validation data. The solid line in Figure 3 corre-
sponds to MAPE for our dynamic forecasting model. We benchmark the performance of
our method against double exponential smoothing. Double exponential smoothing is a
popular short term forecasting method which assigns exponentially decreasing weights
as the observation become less recent and also takes into account a possible (changing)
trend in the data. The dashed line in Figure 3 corresponds to MAPE for double exponen-
tial smoothing.

We notice that for both approaches, MAPE increases as we predict further into the
future. However, while for our dynamic model MAPE increases to only about 5% at the
auction-end, exponential smoothing incurs an error of over 40%. This difference in per-
formance is relatively surprising, especially given that exponential smoothing is a well-
established (and powerful) tool in time series analysis. One of the reasons for this under-
performance is the rapid change in price dynamics, especially at the auction-end. Expo-
nential smoothing, despite the ability to accommodate changing trends in the data, can-
not account for the price dynamics. This is in contrast to our dynamic forecasting model
which explicitly models price velocity. As pointed out earlier, a change in a function’s
velocity precedes a change in the function itself, so it seems only natural that modeling
the dynamics makes a difference for forecasting the ﬁnal price.

4. Conclusions

In this paper we develop a dynamic price forecasting model that operates during the live
auction. Forecasting price in online auctions can have benefits to different auction par-
ties. For instance, price forecasts can be used to dynamically rank auctions for the same
(or similar) item by their predicted price. On any given day, there are several hundred,
or even thousands of open auctions available, especially for very popular items such as
Apple iPods or Microsoft Xboxes. Dynamic price ranking can lead to a ranking of auc-
tions with the lowest expected price which, subsequently, can help bidders make deci-
sions about which auctions to participate in. Auction forecasting can also be beneficial
to the seller or the auction house. For instance, the auction house can use price forecasts
to offer insurance to the seller. This is related to the idea by [2], who suggest offering
sellers an insurance that guarantees a minimum selling price. In order to do so, it is im-
portant to correctly forecast the price, at least on average. While Ghani’s method is static
in nature, our dynamic forecasting approach could potentially allow more flexible fea-
tures like an “Insure-It-Now" option, which would allow sellers to purchase an insur-
ance either at the beginning of the auction, or during the live auction (coupled with a
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

time-varying premium). Price forecasts can also be used by eBay-driven businesses that
provide brokerage services to buyers or sellers.
And a final comment: In order for dynamic forecasting to work in practice, it is im-
portant that the method is scalable and efficient. We want to point out that all components
of our model are based on linear operations - estimating the smoothing spline in Section
3 or fitting the AR model in Section 4 are both done in ways very similar to least squares.
In fact, the total runtime (estimation on training data plus validation on holdout data) for
our dataset (over 700 auctions) is less than a minute, using program code that is not (yet)
optimized for speed.

References

[1] Ghani, R. and Simmons, H. (2004). Predicting the end-price of online auctions. In the Proceedings
of the International Workshop on Data Mining and Adaptive Modelling Methods for Economics and
Management, Pisa, Italy, 2004.
[2] Ghani, R. (2005). Price prediction and insurance for online auctions. In the Proceedings of the 11th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, 2005.

[3] Lucking-Reiley, D., Bryan, D., Prasad, N., and Reeves, D. (2000). Pennies from ebay: the determinants
of price in online auctions. Technical report, University of Arizona.
[4] Bajari, P. and Hortacsu, A. (2003). The winner’s curse, reserve prices and endogenous entry: Empirical
insights from ebay auctions. Rand Journal of Economics, 3:2:329–355.
[5] Jank, W. and Shmueli, G. (2008). Studying Heterogeneity of Price Evolution in eBay Auctions via
Functional Clustering. Forthcoming at Adomavicius and Gupta (Eds.) Handbook of Information Systems
Series: Business Computing, Elsevier.
[6] Shmueli, G. and Jank, W. (2008). Modeling the Dynamics of Online Auctions: A Modern Statistical
Approach. Forthcoming at Kauffman and Tallon (Eds.) Economics, Information Systems & Ecommerce
Research II: Advanced Empirical Methods, M.E. Sharpe, Armonk, NY.
[7] Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis. Springer Series in Statistics.
Springer-Verlag New York, 2nd edition.
[8] Ruppert, D., Wand, M. P., and Carroll, R. J. (2003). Semiparametric Regression. Cambridge University
Press, Cambridge.
[9] Shmueli, G., Russo, R. P., and Jank, W. (2007). The Barista: A model for bid arrivals in online auctions.
The Annals of Applied Statistics, 1 (2), 412–441.
[10] Wang, S., Jank, W., and Shmueli, G. (2008). Forecasting ebay’s online auction prices using functional
data analysis. Forthcoming in The Journal of Business and Economic Statistics.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

A Technology Platform to Enable the

Building of Corporate Radar Applications
that Mine the Web for Business Insight
Peter Z. YEH and Alex KASS1
Accenture Technology Labs

Abstract. In this paper, we present a technology platform that can be customized

to create a wide range of corporate radar applications that can turn the Web into a
systematic source of business insight. This platform integrates a combination of
established AI technologies – i.e. semantic models, natural language processing,
and inference engines – in a novel way. We present two prototype corporate radars
built using this platform: the Business Event Advisor, which detects threats and
opportunities relevant to a decision maker’s organization, and the Technology
Investment Radar which assesses the maturity of technologies that impact a
decision maker’s business. The Technology Investment Radar has been piloted
with business users, and we present encouraging initial results from this pilot.

Keywords. Business Applications, Performance Support, NLP, Inference,

Semantic Models, Business Intelligence, Competitive and Market Intelligence.

Introduction

Many business decisions require a broad understanding of ways that various external
events might impact the business. For example, as managers formulate their company’s
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

annual marketing strategy, or product-development investments, they need to

understand issues like the following: are any events occurring that might indicate a
strategy shift by one of my competitors; what macro-economic events might change the
demand for my services; and which technologies might be maturing to the point where
they might change the competitive dynamics that my company operates in?
It is much easier for a manager to develop a useful perspective on these kinds of
forces if they can systematically and continuously monitor the external environment in
which their company operates for indications of a potential impact. Much raw
information that could enable such monitoring is out there somewhere on the net. The
Web has dramatically expanded the range and volume of external information that can
be used to inform business decisions. Information about contracts won, patents filed,
new products launched, successful technologies applied, job openings advertised, and
so forth are all being made available online everyday. An episode recounted in a recent
Fortune Magazine story [17] about Microsoft and Google illustrates the insights that
can be generated from these kinds of information. The episode involved Bill Gates’ use
of the Web to enhance his insight into the serious nature of the threat that Google could

1
Corresponding author: 50 West San Fernando Street, San Jose, California 95113; E-mail:
[email protected],
Data Mining for Business Applications, edited by C. Soares, and R. Ghani, IOS Press, Incorporated, 2010. ProQuest Ebook Central,
http://ebookcentral.proquest.com/lib/tromsoub-ebooks/detail.action?docID=647889.
Created from tromsoub-ebooks on 2024-08-11 18:34:23.
150 P.Z. Yeh and A. Kass / A Technology Platform to Enable the Building

pose to Microsoft. While “poking around” on the Google corporate website, Gates
glanced at the listing of open positions, and he saw that Google was recruiting for all
kinds of expertise that had nothing to do with search. In fact, Gates noted, Google’s
recruiting goals seemed to mirror Microsoft’s! It was time to make defense against
Google a top priority for Microsoft.
This story shows the value of the Web as a source of business insight, but it also
illustrates how random and unsystematic the process of developing that Web-derived
insight can be. In practice, it is very difficult for any individual person (or even
reasonable-sized group of people) to scan the sheer volume of information, detect what
might be relevant, and do the necessary work to draw appropriate inferences and
connections to transform the raw information into useful business insights. This
process of scanning, detecting, and interpreting is not feasible to do manually at scale.
There are technologies and services, available today, that try to address this need,
but none of them offer a complete solution. Automated clipping services, for example,
can help filter the information stream, and thus represent a step in the right direction.
These services, however, do not help decision makers see the potential implications
that new pieces of information have on their organization’s specific concerns. Many
insights can only be generated by putting together several pieces of raw data from
disparate sources and by applying the relevant business knowledge to interpret them.
For example, only a system that models Microsoft’s current niche and product mix
would be able to detect the relevance of Google’s recruiting priorities. Without such a
model, the system cannot analyze the indirect relationships between events it detects
and the business objectives of the company it seeks to inform, leading it to either
ignore important events or cast its net too broadly. Hence, decision makers need more
than a filtered news source; they need tools that can directly draw connections between
data collected from the Web and the issues that matter to their business.
In contrast, enterprise Business Intelligence (BI) systems can help decision makers
see the implications of new information, but BI systems focus primarily on exploiting
the information that is flowing through a company’s own data systems to help
executives understand what is happening within their business operations. Since not
everything worth monitoring happens within the enterprise, executives need a
capability that can extend the limited, inward-looking scope of existing enterprise BI
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

systems to provide insight about external forces.

To fully address this problem of turning the Web into a source of business insight
will require a new class of applications that can monitor any company’s external
business environment in a way that is loosely analogous to how enterprise BI systems
monitor a company’s internal operations. This new class of applications – we call
corporate radars – must determine what external events are relevant to a company’s
specific business concerns, to detect these events by mining the Web, and to interpret
the implications of these events with respect to the company’s concerns. These
requirements can be satisfied through a novel integration of established AI technologies
like semantic models, natural language processing approaches, and inference engines.
In this paper, we discuss an ongoing effort to develop a technology platform that
can be customized to create a wide range of corporate radar applications that can
systematically turn the vast amount of information on the Web into business insights.
We present two prototype corporate radars built using our platform as part of this
effort:
• The Business Event Advisor, an early prototype, detects threats and
opportunities relevant to a decision maker’s organization. This corporate radar

continuously scans a variety of online sources to produce a dashboard that

reports the types of events detected, the entities involved in these events, and
the implications that can be inferred from these events.
• The Technology Investment Radar, a follow-on effort, assesses the maturity
of technologies that impact a decision maker’s business. This corporate radar
also scans a variety of online sources to produce a dashboard that reports a
maturity assessment for each technology of interest and the detected events
supporting that assessment.
The Technology Investment Radar has been piloted with business users, and we present
initial results from this pilot, which begin to demonstrate the value that these kinds of
systems can actually provide to real users.

1. Corporate Radar Platform

To enable applications that can systematically monitor the Web and turn it into a
source of business insight, we have been developing a technology platform upon which
a variety of corporate radar applications can be built (see Figure 1). This evolving
platform consists of three main components – semantic models, natural language
technologies (we call web sensors), and an inference engine – that interact with each
other to guide the detection of relevant signals from the Web, to produce from these
signals a stream of structured event descriptions, and to interpret the implications of
these events to generate actionable insights.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 1. A schematic of our technology platform.

1.1. Semantic Models

The semantic models in our platform guide the detection and interpretation of relevant
events from the Web. There are three types of models in our platform – a model of the
business dynamic, a set of detection models, and a set of sensor models.
The business-dynamics model provides semantic representations of the ecosystem
in which a decision maker’s organization operates. To understand how this model is
used to drive processing in corporate radar applications, consider the following
example. Imagine that you manage a manufacturing company that attempts to use the
Web to discover actionable insights by running a system that monitors news stories and
price data. If your system can only notice price changes for your competitors’ products,

then the system would be of limited value because it might be too late to react once the
threat is that immediate. Moreover, something so directly relevant to the business will
most likely be noticed by company personnel (and hence eliminates the needed for an
automated solution). Now suppose instead that your system can notice price changes
for raw materials, rather than competing products, and moreover the raw materials are
not used in any of the products made by your company. If these raw materials are used
by your competitors, then a price change for any of these materials might have an
important, though indirect, impact on your business. Such price shifts happen all the
time, and humans trying to track and interpret all of these shifts may quickly become
overwhelmed. Suppose we have a system with a model describing who your
competitors are, which of their products compete with yours, and what raw materials
are used in each product. A relatively simple model like this, combined with basic
inferences about cost/price relationships, can enable a corporate radar to see that
although you do not use the raw material in question, a drop in the price of that
material may mean that your competitor can lower the price of its products, thereby
putting price pressure on you (see Figure 2).

mined from the Web.

As the above example illustrates, the business-dynamics model includes semantic

representations of the entities that make up the competitive ecosystem in which a
particular company operates. These representations encode the entities’ attributes (e.g.
super-types, sub-types, etc.) and their relationships to each other (i.e. supplier,
competitor, etc.). The business-dynamics model also includes representations of events
that can impact the ecosystem. These representations encode the events’ type, their
participants, and their implications. We represent these entities and events as
conceptual graphs [15], but other formalisms such as RDF graphs [10] can be used as
well. Figure 3 shows semantic representations for a deploy event and a hypothetical
wireless technology vendor.

Figure 3. Left: The semantic model for a deploy event. Right: The semantic model for a hypothetical
wireless company – Acme Wireless.

To enable reuse and hence reduce the effort needed to customize the business-
dynamics model across different applications and domains, our platform uses an upper
ontology of generic concepts that can be extended to build domain specific ones. This
upper ontology – the Component Library [2] – provides a library of about 500 generic
events and entities that can be composed (and extended) to build business-dynamics
models for a wide range of corporate radar applications.
Detection models provide support for natural language processing which is needed
to detect and convert unstructured text on the Web into a structured event description –
i.e. a representation of the event type and the semantic roles of the entities participating
in the event. We use WordNet [6] and case theory [1] as detection models in our
technology platform. WordNet provides the lexical realizations for events and entities
in a business-dynamics model – i.e. these events and entities are annotated with the
corresponding senses from WordNet to indicate how they surface in language. For
example, a deploy event is annotated with the WordNet senses of deploy#2, launch#5,
etc.
Case theory provides the syntactic realizations for the semantic roles that an entity
can play in an event. For example, an entity performing an event plays the semantic
role of an agent and this role – according to case theory – can surface as a prepositional
phrase marked by the preposition “by” (e.g. )
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

or as the subject (e.g. ).

Sensor models provide semantic representations of the web sensors and enable a
loose coupling between the implementation of these sensors and the rest of our
technology platform. Each sensor model encodes the sensor’s type, the type of event
detected by this sensor, the confidence in the output, and the underlying
implementation to invoke. This abstraction allows the inference engine to determine
which sensors to invoke without regard for their implementation. Figure 4 shows the
model for a sensor to detect business acquisition events that a corporate radar
application may employ.

Figure 4. The semantic model for a sensor that detects business acquisition events.

1.2. Inference Engine

The inference engine plays three core roles in our technology platform. 1) It uses the
business-dynamics model to determine which events to detect from the Web. This task
is accomplished by retrieving the events encoded in this model. 2) The inference
engine uses sensor models to determine which web sensors to invoke for the events to
be detected. A sensor is invoked if the event type encoded in its corresponding model
subsumes the event to be detected. 3) The inference engine generates actionable
insights from detected events by applying implications encoded in the corresponding
event representations from the business-dynamics model. The inference engine, for
example, can apply the implication encoded in the representation of the deploy event
(see Figure 3 right from Section 3.1) to a detected event about a competitor’s supplier
deploying a new product. The resulting inference will generate insight about the
competitor being able to improve its products that use the newly deployed one.
In addition to the above requirements, we also require the inference engine to
support (and reason over) expressive implications like the one in the above example –
i.e. implications that consider an entity’s role in a detected event and its relationship to
others within the overall ecosystem. This additional requirement improves the
relevancy of the insights generated. For example, if a supplier for your company
deploys a new product, then applying the same implication, from before, would result
in a completely different insight to be generated – i.e. your company would be able to
improve its existing products instead of your competitor.
We use the Knowledge Machine (KM) [5] to provide the requirements described
above, but other implementations such as Pellet [14] or Jena [12] can be used as well.
KM is a frame-based inference engine grounded in first order predicate logic. It
provides a query language to retrieve information of interest about the models – e.g.
events encoded in the business-dynamics model. KM also supports subsumption
reasoning – used to determine the appropriate sensors to invoke – and reasoning with
implication rules – used to interpret the implications of detected events.
KM provides additional capabilities such as reasoning about quantitative
constraints – e.g. at least one commercial sale; more than 10 deployments; etc. – and a
situation calculus to reason about how changes in the world relate to existing
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

information – e.g. reports of orders for chipsets may imply a handset deployment
within 6 months. These additional capabilities allow our technology platform to
support (and reason over) expressive implications to improve the relevance of the
insights generated.
It is worth mentioning that most corporate radar applications focus primarily on
weak signals, resulting in inferences that may not be sound. Hence, the resulting
inferences should be viewed as suggestions that the implied events could be happening
(as opposed to deductive conclusions that it is), so it is up to the user to decide the
likelihood of the implication actually occurring. To help the user make this judgment,
the inference engine provides provenance information about which weak signals led to
a particular implication. A weak signal combined with many others that point to the
same implication will give the user more cause to believe in the likelihood of that
implication over an implication resulting from a lone signal without any corroboration.
We are currently exploring more sophisticated (and automatic) ways of weighing the
likelihood of an implication, but at this point we leave this task to the user.

1.3. Web Sensors

Web sensors detect relevant unstructured signals on the Web and produce from them
structured event descriptions that are consumed by the inference engine to generate
actionable insights. Decisions about what types of sensors to use and how they are
implemented depend on the specific corporate radar application. Some corporate
radars, like the Business Event Advisor described in Section 2.1, use a single sensor to
produce structured representations for all events that need to be detected. Other radars,
like the Technology Investment Radar described in Section 2.2, employ a collection of
specialized sensors – each targeting a specific event such as sales, deployments, etc.
Hence, our technology platform must support a variety of different web sensors.
Our platform satisfies this requirement through the sensor and detection models (see
Section 3.1). The sensor models provide an abstraction of the sensors which allows the
inference engine to determine which sensors to invoke without regard for their
implementation – e.g. the model for a business acquisition sensor (see Figure 4)
abstracts away the implementation of this sensor by encoding information such as the
type of the event detected and the confidence in the output that the inference engine can
reason over.
The detection models (see Section 3.1 also) provide linguistic support that specific
implementations may use to process unstructured text on the Web – e.g. many natural
language processing algorithms [11,13,16,18] produce structured representations from
text using lexical (and syntactic) knowledge which our platform provides.

2. Two Corporate Radar Applications

In this section, we describe the two corporate radar applications that we have built on
the technology platform described in the previous section.

2.1. Business Event Advisor

develop the kind of corporate radar we discussed above. The objective is to help
executives identify external events that might constitute threats to – and opportunities
for – their business. For example, there would be important business value for
executives who could more consistently notice signs that a competitor might introduce
a new product to directly compete with one of their products, or that a supplier was at
risk of failing to deliver. The current implementation of the business event advisor
detects a small set of event types, but the approach it employs is broadly applicable.
Everything from a competitor’s online job-recruiting advertisements to announcements
of deals made are types of information that – if systematically tracked and interpreted
in terms of the business dynamics that govern the executive’s business – can be used to
provide these early warning signals.
The system is designed to address these needs by detecting, organizing, and
interpreting a broad range of external business events in order to help business
decision-makers spot external threats and opportunities affecting their business. The
system achieves this capability using a model of the business dynamics that encodes
the entities and events that impact the ecosystem in which a particular company
operates. This model, for example, can encode entities like manufacturers, the products

they make, their suppliers, their customers, etc., and can encode events like executive
hirings, mergers and acquisitions, price changes, etc. The specific entities and events
encoded depend on the company that the model (and hence the application) is
customized for.
The Business Event Advisor uses this model to continuously scan a wide range of
news sources on the Web to generate an executive dashboard like the one detailed in
Figure 5. This dashboard makes it possible to see systematically the landscape of
relevant events – categorized by event type, participants in the event, estimated
importance, and the portion of the ecosystem impacted.
To detect and produce structured representations for events of interest from
unstructured text, the system employs a single web sensor built using an open-source
classification engine to determine the most likely event type combined with a
commercial natural language processing product to recognize the relevant entities. This
web sensor also used a library of syntactic patterns to determine the semantic roles
played by these entities. We refer the reader to [9] for a detailed discussion of how we
integrated these components.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 5. A portion of the executive dashboard produced by the Business Event Advisor.

This application also allows the user to examine the details for any event such as
the raw signals from which the event was detected and the implications that are inferred
from it. Figure 6 details this feature for a product introduction event that the application
has detected. This event was detected from a story about Denso introducing new hybrid
vehicle components and suggests to corporate executives the possible threats (e.g.
) and opportunities (e.g.
+) that might impact their company.

Figure 6. A detailed view of an event detected by the Business Event Advisor.

The Business Event Advisor is a working system, but not a complete or robust one
that is ready for real users. Its main purpose was to demonstrate the value of the
corporate radar vision and our technology platform. The ambitious scope of the
application conveyed that vision but was too broad for a small research team to build
out at scale. We next set out to build a second corporate radar, with a more focused
scope, that could be built out in full and provide value for real users.

2.2. Technology Investment Radar

Decision makers often recognize – early on – the potential for a technology to have an
important impact on their business, but have difficulty determining when this potential
will be realized. For example, many executives in the mobile phone industry recognize
WiMax as a technology that may have a significant impact on their industry, but they
are less certain about whether (and especially when) that impact will be realized. Some
technologies that look promising in the lab never make it to market; others that go to
market become niche products which never deliver on the impact they promised
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

originally; and finally those that do deliver on their promise may do so on a different
time-line than one might have anticipated when the technology first began to emerge.
In order to manage their company effectively, executives need to continuously track
technologies to determine when various levels of investment are worthwhile – e.g.
when to invest in building up in-house expertise on the technology; when to start
designing and offering products based on the technology; etc.
The Technology Investment Radar is designed to address these needs by helping
decision-makers track the maturation of technologies that relate to their business and
understand when these technologies are mature enough to justify investing in them.
The Technology Investment Radar achieves this capability by using a model of the
business dynamics that encodes the events and entities that impact the ecosystem
surrounding a company and how they affect the maturation of technologies that are
relevant to the company. This information about technology maturation, which we call
the technology lifecyle, encodes the following stages that a technology can advance
through as it matures.

Associated with each stage are a set of gates (i.e. conditions) that must be met in
order for a technology to enter into that stage. These gates encode how entities (e.g.
manufacturers, suppliers, etc.) and events (e.g. sales, deployments, etc.) that impact the
target ecosystem determine which maturation stage a technology belongs in. For
example, some user’s company may require that there be at least five sales (or
deployments) of the technology in order for that technology to be considered as being
in the emerging stage. This is an example of a gate.
Like the Business Event Advisor, the Technology Investment Radar uses its
business-dynamics model to continuously scan a variety of sources – e.g. RSS feeds,
blogs, public forums, standards sites, etc. – to produce a dashboard. Its dashboard is
detailed in Figure 7 which shows the maturation stage that each technology has
advanced to.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Figure 7. The executive dashboard produced by the Technology Investment Radar.

The Technology Investment Radar uses a collection of specialized web sensors to

detect events of interest. Some sensors target a specific event – e.g. deployments –
while others target a class of related events – e.g. purchase events and its
specializations like mergers and acquisitions. These different sensors are implemented
using various natural language processing algorithms like [7,8,11,13,16,18].
This application also allows the user to examine the details for any stage such as
the gates for a stage and how close these gates are to being satisfied. Figure 8 shows
the gates for the emerging market stage and which of these gates are satisfied for

WiMax. Moreover, the user can further examine the details for any gate such as the
events that have been detected which support a gate.

Figure 8. A detailed view of the emerging stage in the Technology Investment Radar,

3. Evaluation

We recently evaluated the Technology Investment Radar through a pilot study

conducted with Accenture’s Wireless Community of Practice (CoP) – an organization
within Accenture that focuses on wireless technology consulting. We give an overview
of the results from this study.

3.1. Pilot Setup

We customized the Technology Investment Radar for the Wireless CoP by modeling
the gates (and the events enabling them) that must be satisfied for each stage in the
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

business-dynamics model for this group (see Section 2.2 for a description of these
stages). We acquired these gates by interviewing five analysts from the Wireless CoP
for the metrics and criteria they use in assessing technology maturity. The job
description of these analysts is to monitor developments for various wireless
technologies in order to inform internal investment decisions, provide strategy and
technology consulting to external clients, and so forth.
Once the system was customized, we used it to track seven different wireless
technologies: WiMax, Mobile WiMax, WiFi, HSDPA, HSUPA, EVDO, and Mobile
TV.
We then enlisted 11 additional analysts from the Wireless CoP to evaluate the
Technology Investment Radar. Our system was the only one evaluated in this pilot
study because a competing system does not exist.

3.2. Accuracy

We evaluate the accuracy of the maturity assessments given by the Technology

Investment Radar. We measure the accuracy of the system as the fraction of
assessments given by the human analysts that agreed with those given by the system.
To obtain this metric, each analyst was instructed to provide maturity assessments
for the seven technologies being tracked. These assessments are based on the analyst’s
knowledge and understanding of the technologies. The analyst assessments were then
compared against those given by the Technology Investment Radar. We say that an
analyst and the Technology Investment Radar agree on the maturity of a technology if
they placed the technology in the same maturation stage. Table 1 shows the result of
this evaluation.

Table 1. The agreement between the Technology Investment Radar and human analysts on the maturity of
seven wireless technologies. The first column lists the technologies tracked. The second column lists the
agreement between our system and the human analysts given as percentages. The last column lists the
number of assessments made by the human analysts for each technology.

Technology Agreement # Assessments

WiMax 81.82% 11
Mobile WiMax 81.82% 11
HSDPA 9.09% 11
HSUPA 45.45% 11
EVDO 81.82% 11
WiFi 18.18% 11
Mobile TV 63.64% 11
Overall 54.55% 77

analysts (and hence the accuracy of the maturity assessments produced by the system)
is 54.55%. Compared with chance agreement (which is 1/6) this difference is
statistically significant (p < 0.01 for the 2 test), but several technologies – e.g. WiFi
and HSDPA – had low agreement which led us to examine the cause. We found that a
recurring cause was disagreement with the business-dynamics model from the analysts.
For example, several analysts disagreed with a gate called Vendor Consolidation in the
business-dynamics model because the events enabling this gate – i.e. merger and
acquisition events – were too restrictive. Other events like corporate alliances can also
enable this gate. Hence, we revised this model based on recurring disagreements from
the analysts.
We evaluated these revisions effect on the accuracy of the maturity assessments
given by the Technology Investment Radar. We enlisted 5 new analysts from the
Wireless CoP and had them provide maturity assessments using the same methodology
as above. Table 2 shows the result of this evaluation.
The overall agreement between the Technology Investment Radar and the new
human analysts (and hence the new accuracy of the system) is 74.28%. Compared with

the overall agreement from before (i.e. 54.55%), this difference is statistically
significant according to the 2 test (p < 0.05).
Some technologies still had low agreement – e.g. HSDPA and HSUPA – but the
reason was due to disagreement among the human analysts. In each case, the maturity
assessment given by the Technology Investment Radar was the same assessment given
by the majority of the analysts.

Table 2. The effect that the revised model of the business dynamics had on the agreement between the
maturity assessments given by the Technology Investment Radar and human analysts for the wireless
technologies tracked.

Technology Agreement # Assessments

WiMax 80.00% 5
Mobile WiMax 60.00% 5
HSDPA 60.00% 5
HSUPA 60.00% 5
EVDO 80.00% 5
WiFi 80.00% 5
Mobile TV 100.00% 5
Overall 74.28% 35

3.3. Utility to End Users

We also assessed, qualitatively, the utility of the Technology Investment Radar from an
end user perspective (e.g. will the analysts continue to use the system after the pilot,
how satisfied are the analysts with the tool, and so forth). This assessment was done
through an exist survey administered to the analysts from the pilot study.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

The survey was completely anonymous. It was hosted on a third party survey
hosting site where the identities of the respondents were not known to us. Hence, the
respondents were not under any pressure to respond favorably. The survey consisted of
25 questions, but given the limitation in space, we will not present responses from all
these questions. Instead, we give an overview of the highlights from the survey based
on preliminary responses from 9 of the analysts.

• When asked to indicate their overall satisfaction with the system – the possible
answer choices are very satisfied, somewhat satisfied, neutral, somewhat
dissatisfied, and very dissatisfied – 66.7% of the analysts said very satisfied,
22.2% said somewhat satisfied, and 11.1% said neutral. No analysts gave a
somewhat dissatisfied or very dissatisfied response.
• When asked to indicate if they will continue to use the system after the pilot
study – the possible answer choices are yes and no – 88.9%of the analysts said
yes and 11.1% said no.

• When asked if they would recommend the system to a colleague – the possible
answer choices are yes and no – 88.9% of the analysts said yes and 11.1% said
no.
• When asked to indicate how using the Technology Investment Radar to track
technology maturation compared to their current method – the possible answer
choices are much better, somewhat better, about the same, somewhat worse,
and much worse – 22.2% of the analysts said much better, 66.7%said
somewhat better, 0.0% said about the same, and 11.1% said somewhat worse.
No analysts gave a much worse response.

These responses show that the majority of the analysts found the Technology
Investment Radar to be useful and will continue to use the system after the pilot. These
responses also demonstrate the value of corporate radars that can automatically mine
the Web for business insights that are relevant to decision maker’s organization – in
this case insight regarding the maturity of technologies that impact the Wireless CoP.

4. Conclusion and Future Work

We have discussed an ongoing effort to develop a technology platform that can be

customized to create a wide range of corporate radar applications, each of which uses a
customized semantic model to turn the Web into a systematic source of relevant
business insight. These corporate radar applications must determine what external
events are relevant to a company’s specific concerns, detect these events by mining the
Web, and interpret the implications of these events with respect to the company’s
concerns. Our platform enables the development of such radars by building on
established AI technologies – i.e. semantic models, natural language approaches, and
inference engines – and integrating them in a novel way.
We presented two prototype corporate radars built using our platform: an early
prototype called the Business Event Advisor, which detects threats and opportunities
relevant to a decision maker’s organization, and the Technology Investment Radar, a
follow-on effort, which assesses the maturity of technologies that impact a decision
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

maker’s business. The Technology Investment Radar was piloted with business users,
and we presented initial results from this pilot, which begin to demonstrate the value
that these kinds of systems can actually provide to real users.
Our initial experience with these prototypes – and the pilot results – has been
encouraging. However, there remain several issues that must be addressed in order to
turn our work into a solution that business analysts can readily use to build robust
radars customized for their concerns. This goal will require support that allows
business analysts, who are not trained knowledge engineers, to customize and create
the semantic models. We have begun developing a set of GUI based business
environment modeling tools that will enable business analysts to perform this task. We
are also leveraging previous research on enabling subject matter experts to author
semantic models without the aid of knowledge engineers as part of this effort [3].
Achieving a more robust solution will also require either automated or semi-automated
methods that can enhance and update the semantic models used by corporate radars.
Although using a generic upper ontology as a starting point improves reuse and reduces
the customization effort required, a sizeable amount of time is still required to extend
this upper ontology for each new radar application and to update these models once an

organization’s concerns change. We are exploring approaches that can enhance and/or
update existing models based on emerging patterns of detected events – e.g. repeating
patterns of product introduction events coming from a competitor’s supplier, followed
by product feature change events coming from that competitor, might be recognized by
a rule learning system to create a rule whereby product introductions imply feature
changes in appropriately related entities.

Acknowledgement

This chapter draws on previous papers we have written on this topic, including
“Business Event Advisor: Mining the Net for Business Insight with Semantic Models,
Lightweight NLP, and Conceptual Inferences” and “Using Lightweight NLP and
Semantic Modeling to Realize the Internet’s Potential as a Corporate Radar”. We like
to thank Chris Cowell-Shah who contributed to the authoring of these papers and to the
evolution of our thinking on corporate radars. We also want thank Chris for the
implementation of the Business Event Advisor.

References

[1] K. Barker, Semi-Automatic Recognition of Semantic Relationships in English Technical Texts, PhD
thesis, University of Ottawa, 1998.
[2] K. Barker, B. Porter, and P. Clark, A Library of Generic Concepts for Composing Knowledge Bases,
KCAP, 2001.
[3] K. Barker et al., A Knowledge Acquisition Tool for Course of Action Analysis, IAAI, 2003.
[4] T. Berners-Lee, J. Hendler, and O. Lassila, The Semantic Web: A New Form of Web Content that is
Meaningful to Computers Will Unleash a Revolution of Possibilities, Scientific American, 2001.
[5] P. Clark and B. Porter, KM: The Knowledge Machine, Technical Report,
http://www.cs.utexas.edu/users/mfkb/RKF/km.html.
[6] C. Fellbuam, WordNet: An Electronic Lexical Database, MIT Press, 1998.
[7] D. Gildea and D. Jurafsky, Automatic Labeling of Semantic Roles, Computational Linguistics 28(3),
2002.
[8] K. Hacioglu, Semantic Role Labeling Using Dependency Trees, COLING , 2004.
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

[9] A. Kass and C. Cowell-Shah, Business Event Advisor: Mining the Net for Business Insight with
Semantic Models, Lightweight NLP, and Conceptual Inference, KDD Workshop on Data Mining for
Business Applications, 2006.
[10] O. Lassila and R. Swick, Resource Description Framework (RDF) Model and Syntax Specification,
Technical Report, W3C, 1999.
[11] M. Lesk, Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine
Cone from an Ice Cream Cone, 5th International Conference on Systems Documentation, 1986.
[12] B. McBride, Jena: Implementing the RDF Model and Syntax Specification, Semantic Web Workshop,
2001.
[13] R. Mihalcea and D. Moldovan, An Iterative Approach to Word Sense Disambiguation, FLAIRS, 2000.
[14] E. Sirin, B. Parsia, B. Grau, A. Kalyanpur and Y. Katz, Pellet: A Practical OWL-DL Reasoner,
Technical Report UMIACS, 2005.
[15] J. Sowa, Conceptual Structures: Information Processing in Mind and Machine, Addison-Wesley, 1984.
[16] R. Sweir and S. Stevenson, Exploiting a Verb Lexicon in Automatic Semantic Role Labeling,
HLT/EMNLP, 2005.
[17] F. Vogelstein, Gates vs. Google: Search and Destroy, Fortune 151(9), 2005.
[18] P. Yeh, B. Porter, and K. Barker, A Unified Knowledge Based Approach for Sense Disambiguation and
Semantic Role Labeling, AAAI, 2006.

Spatial Data Mining in Practice:

Principles and Case Studies
Christine KÖRNER a,1 , Dirk HECKER a , Maike KRAUSE-TRAUDES a ,
Michael MAY a , Simon SCHEIDER a , Daniel SCHULZ a , Hendrik STANGE a ,
Stefan WROBEL a,b
a
Fraunhofer IAIS, Sankt Augustin, Germany
b
Department of Computer Science III, University of Bonn, Germany

Abstract. Almost any data can be referenced in geographic space. Such

data permit advanced analyses that utilize the position and relationships
of objects in space as well as geographic background information. Even
though spatial data mining is still a young research discipline, in the
past years research advances have shown that the particular challenges
of spatial data can be mastered and that the technology is ready for
practical application when spatial aspects are treated as an integrated
part of data mining and model building. In this chapter in particular,
we give a detailed description of several customer projects that we have
carried out and which all involve customized data mining solutions for
business relevant tasks. The applications range from customer segmen-
tation to the prediction of traﬃc frequencies and the analysis of GPS
trajectories. They have been selected to demonstrate key challenges, to
provide advanced solutions and to arouse further research questions.

Keywords. spatial data mining, algorithms, case studies

Introduction

Over the past years the interest in spatial data has clearly been pushed by the
wide availability of recording technologies such as the Global Positioning Sys-
tem (GPS), mobile phone data or radio frequency identiﬁcation (RFID). Today,
nearly all database systems support data types for the storage and processing
of geographic data. However, knowledge discovery from geographic data is still
a young research direction. In classic data mining many algorithms extend over
multi-dimensional feature space and are thus inherently spatial. Yet, they are not
necessarily adequate to model geographic space.
Spatial data mining combines statistics, machine learning, databases and vi-
sualization with geographic data. The task is to identify spatial patterns or ob-
jects that are potential generators of such patterns. This includes also the iden-

1 Corresponding Author: Christine Körner, Fraunhofer IAIS, Schloss Birlinghoven, 53754

Sankt Augustin, Germany; E-mail: [email protected].

tiﬁcation of information which is relevant to explain the spatial patterns and to

present the results in an intuitive way that supports further analysis of the data.
Georeferenced data differ in a number of ways from traditional tabular data
and therefore challenge the application of data mining methods to spatial prob-
lems [1]. First, autocorrelation is typical for data within a geographic context, yet
it is unusual in traditional data mining. Second, spatial data types range from
simple point data to complex objects and can further be combined to network
structures and spatio-temporal data structures. These structures must be handled
by the data mining algorithms and require sophisticated feature extraction meth-
ods. Furthermore, feature extraction is known to be the most time-consuming
step during data mining, which is even more true for operations on spatial ob-
jects. A third challenge therefore is the development of specialized algorithms that
interweave the feature extraction and the data mining step. Naturally, on-the-fly
feature extraction in combination with early pruning of the search space can lead
to substantial performance improvements. The following three paragraphs briefly
introduce the nature of spatial data with respect to the presented challenges.
Spatial phenomena are characterized by autocorrelation. Tobler [2] formu-
lates this basic principle as follows: “[...] everything is related to everything else,
but near things are more related than distant things”. Autocorrelation is a pow-
erful resource to improve inference, however it can cause poor performance for
algorithms that ignore it [3]. Traditional data mining methods assume that data
are independent and identically distributed (iid), thus they are not prepared to
model autocorrelation. Spatial autocorrelation can be measured, for example, by
Geary’s c or Moran’s I statistics [4].
References to geographic space can be modeled in several ways, resulting in
different spatial data types. In general, continuous phenomena that spread in
space, e.g. temperature or humidity, and geographically referenced objects, e.g.
houses or rivers, are distinguished. The former type is modeled as field data and
has been explored extensively in the area of geostatistics. One of the most well-
known regression techniques for field data is probably Kriging [5]. The latter type
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

is modeled using vectors in form of points, lines or polygons. Depending on the

importance of their true spatial extent, objects may be generalized to point data
by their centroid. Vector data can be combined to form networks or tessellations
in space. The addition of temporal information, as found in trajectories or geo-
graphically referenced time series, introduces even more complexity to the data.
The relationships between spatial objects offer a rich source of information for
data mining. Therefore, a number of spatial feature extraction and aggregation
methods have been developed. One basic relational feature is the distance between
two objects, which is usually measured using the Euclidean distance. For lines
and polygons the shortest distance between any two points of the objects or
their minimum bounding rectangles can be used. However, in practice distance is
often calculated between centroids. More complex spatial features can be derived
from the topological relationship between two objects as described by the 9-
intersection model [6] or by the connectivity of a network. Another useful method
for feature extraction is aggregation, which summarizes information in a particular
neighborhood of an object. The neighborhood is commonly defined by buffers,
drive-time zones or Voronoi polygons.

The following sections present recent industry projects at Fraunhofer IAIS

that demonstrate the demand for and beneﬁt of spatial data mining. The sections
are organized to reﬂect the increasing complexity of spatial data types. Besides
the application scenario, each project highlights one or more aspects of the above
challenges for spatial data mining and shows a practical solution.

1. Spatial Data Mining for Marketing and Planning

This section presents three case studies in marketing and planning which utilize
vector data for their analysis. The ﬁrst study forecasts sales at potential new
locations for a trading company and emphasizes the handling of large amounts of
spatial features. The second and third study apply visual analytics and subgroup
discovery for customer segmentation and the optimization of mobile networks
respectively.

1.1. Sales Forecasting for Retail Location Planning

Choosing the appropriate site is crucial for the success of every retailing company.
From a microeconomic point of view, the expected sales at a location are the most
important decision criterion for the evaluation of potential new sites. However,
sales forecasting is still a great challenge in retail location planning today. How
can sales at potential new locations be predicted? And which factors influence
sales the most?
Our project partner is one of Austria’s leading trading companies. In order
to reduce the risk in location decisions while continuing growth, the company
sought for an automated sales forecasting solution to evaluate possible new sites.
In our project we identified and quantified the most important factors influenc-
ing sales at operating store locations for three different product lines and store
formats (supermarket, hypermarket and drugstore). The main challenge of the
project was to handle an abundance of attributes which possessed diverse levels
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

of spatial resolution and for which the most appropriate resolution was not known
beforehand. We applied support vector machines (SVM) for the regression task
as they are robust in the face of high-dimensional data. SVMs are not spatial by
themselves, therefore we conducted extensive feature extraction during which all
spatial operations were performed.
The training set for model learning was made up of about 1,400 existing
stores from all over Austria and a broad variety of socio-economic, demographic
and market data on different administration levels as well as competitor informa-
tion and points of interest (POI). Most of the socio-economic and market data
were available on hierarchical spatial aggregation levels of states, districts resp.
cities, municipalities and Zählsprengel as well as post code areas. Zählsprengel
are subunits of municipalities at the lowest spatial aggregation level for which
official statistics are available (around 1,000 inhabitants on average). They proved
to be especially valuable for modeling purposes because they reflected most of the
spatial variability.
In order to characterize the environment of individual shops, we first built
trading areas for which socio-economic, demographic, competitor and POI infor-

mation was aggregated. The feature extraction process for each source of infor-
mation is described in more detail in the following paragraphs. Generally, ag-
gregation can be performed using buffers or drive time zones. They mark, for a
fixed location, the area which lies within a given range or which can be reached
within a given time respectively. However, location factors show different effects
on different levels of spatial aggregation, and it had been unknown beforehand
which levels would yield the highest impact. For instance, if attributes had been
taken into account solely based on 5-minutes drive time zones, important posi-
tive shopping linkages which mostly appear within the range of a 3-minutes walk
would have been lost. Therefore, we built several trading areas with varying spa-
tial extent based on drive time zones for cars and pedestrians (1-5 and 1-3 minutes
respectively) as well as buffers with a distance between 100 and 500 meters based
on the street network. This resulted in a total of 13 trading areas per store.
Naturally, the trading areas did not correspond to the spatial units by which
the socio-economic and demographic data were provided. Therefore, an assign-
ment of attribute values in proportion to the intersecting area of a trading cell
and other spatial units was made. Let ta denote the trading area of interest and
u ∈ U the spatial units that carry some attribute a(). The assignment is specified
by:

area(ta ∩ u)
a(ta) = ∗ a(u).
area(u)
u∈U

This approach assumes that socio-economic and demographic characteristics are

equally distributed within a given spatial unit. However, especially rural areas
with sparse population violate this assumption and skew the assignment. There-
fore, we regarded only the proportion of built-up areas for the redistribution of
attribute values.
The stress of competition was expressed by counting the number of competi-
tor stores within the trading areas. In addition, we aggregated their membership
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

to a competing retail chain, shop type and size, estimated turnover as well as
opening hours. It is crucial to incorporate competitive effects in the forecasting
model because competition is always a strong determinant of the amount of own
sales. However, it is important to notice that competition must not necessarily
be negative; it can also have positive impacts by raising the cumulative attrac-
tion of a site. We further took the distances to own shops of the company as
competitive factor into account because they also draw off sales from a location
(a phenomenon which is called retail cannibalization). Last, we included the ge-
ographical coordinates of the locations into our model to account for local and
global trends.
We created location specific geographic features as, for example, population
density, centrality, accessibility or a site’s shopping linkage potential. To assess
the shopping linkage potential, we evaluated 3,000 different branches of POI with
regard to their individual interception potential for the product lines and retail
formats of our project partner. We again ascertained the number of relevant POI
because it was expected that a high number of affine POI would increase the
attraction of a site and thus lead to higher sales. The process of geographical

feature construction was supported by extensive visual analysis. The comparison

of the spatial distribution of sales and touristic features (for instance the number
of touristic POI as ski-lifts, hotels or holiday homes or the number of employees in
the tourism sector) led to the discovery of a significant positive spatial correlation
between sales and tourism.
At the end of the data preparation step, about 200 attributes for each trading
area of each store were determined. Partly, these attributes had been considered
as important a-priori. However, our analysis also revealed complex geographical
features which had not been considered to be important sales determinants in
certain parts of Austria before.
Taking the annual sales of each store of the year 2005 as response variable, we
finally selected the most important features from the vast amount of independent
variables and quantified their impact on sales. For this purpose we applied a
wrapper approach with forward feature selection and calculated the regression
coefficients for the selected attributes via support vector regression [7]. Due to
the fact that the amount of sales for different merchandising lines and retail
formats depends on different location factors, we developed specific models for new
locations of supermarkets, hypermarkets and drugstores. The resulting regression
models reduced the vast number of input variables to a set of 25 to 60 relevant
features.

1.2. Customer Segmentation for Marketing Services

One of the leading German gas suppliers in the B-to-B market provides database
marketing services for power authorities and local energy providers. A main chal-
lenge in this domain is to provide reliable knowledge about gas customers: What
are the main factors which inﬂuence customer interest in natural gas? How can
potential customers be reliably classiﬁed according to these characteristics? And
how can this knowledge be used to automatically support the selection of ad-
dresses for direct marketing purposes?
Spatial data mining and knowledge discovery are considered to be a promis-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

ing way to deal with the above challenges, as the application involves the devel-
opment of models with geographically constrained validity, models using indirect
and contingent relations on geographical objects as well as efficient methods for
discovering this knowledge. The goals within our project were to a) find reliable
candidate features for customer description and b) classify addresses according
to the probability of customer interest in a sales representative visit.
The empirical basis of the study was a combined database of nationwide
address data with description of buildings, a database of discrete geographical
objects as rivers and elevation fields from a topographical map and a georeferenced
sample of response data from about 500,000 nationwide interviews (see left plot
in Figure 1).
In the data preparatory step the regional sample of response data was enriched
with building data and geographic context. Thereby, the relation between the
regional sample and the building data was realized by georeferencing the given
addresses. The enriched sample served as training set for the analysis of interesting
and statistically extraordinary subgroups and for the construction of a model for
rule-based classification of addresses with high response probability.

Figure 1. left: data basis with georeferenced addresses and geographical objects; right: visual
exploratory analysis of customer response ratio

In a first step we explored the data using techniques from visual analytics.
Subsequently, the resulting hypotheses were tested for statistical significance using
binomial tests and subgroup mining. Visual analytics is the “science of analytical
reasoning facilitated by interactive visual interfaces” [8]. Especially in geographic
context the visualization of information plays an important role to profit from
background knowledge, flexible thinking and imagination of human analysts [9].
Subgroup discovery detects groups of objects with common characteristics
that show a significant deviation in their target value with respect to the whole
data set. In our application we searched for subgroups with a significantly larger
response probability to marketing campaigns than in general. The quality of a
subgroup depends on a quantitative and a qualitative term, which measure the
size of a subgroup and the pureness of the target attribute within the subgroup
respectively. More precisely, the quality q of a subgroup h is defined as

and accounts for the diﬀerence of target share between the subgroup p and the
whole data set p0 , as well as the size n of the subgroup [10]. Spatial subgroups
are formed if the subgroup deﬁnition involves operations on spatial components
of the objects. However, spatial operations are expensive. They lead to a loss
of performance during execution or require additional storage when computed
in advance. Klösgen and May [11] developed a spatial subgroup mining system,
which integrates spatial feature extraction into the mining process. They exploit
the fact that it may not be necessary to compute all spatial relations due to early
pruning in the mining process. The spatial joins are performed separately on each
search level, which reduces the number of spatial operations and avoids redundant
storage of features.
One major result of our study was that geographic relations, such as river
distance and ground elevation, as well as the age of buildings can be used to
improve the response probability of a sample of addresses. One example for an
interesting subgroup of customers were people using heating oil instead of gas
and living within 1 km distance from a larger river, which could be explained by

the speciﬁc ﬂooding risk for oil tanks. Figure 1 (right) shows an example of this
pattern as experienced during visual exploratory analysis.

1.3. Mobile Network Planning

The quality and coverage of a mobile network are key factors for the success of
a mobile telecommunication company. In order to support decisions about the
extension and optimization of such a network, we analyzed the capacity, quality
and cost-effectiveness of the mobile network of one of the leading German mobile
telecommunication companies. The goal of the project was to identify rural areas
with a high demand in mobile network services and to relate the demand to
demographic and geographic characteristics.
Mobile networks extend over geographic space and therefore make a strong
claim on the inclusion of geographic data in the analysis process. A first explo-
rative data analysis showed, for example, a decreasing network quality within
cells in increasingly hilly areas. The overall input data of the project consisted
of network usage, demographic and geographic information. In the data prepara-
tory step we merged all three kinds of data and aggregated attribute values such
as population and POI for radio network cells. In addition, we defined a target
attribute which describes the demand of (future) network services:

number of calls
cell potential = ∗ population.
number of customers

It weights the population of an area with the average number of calls of the
present customers. Similar to the above project about customer segmentation,
we applied subgroup discovery to detect variables that inﬂuence the demand for
mobile network services. We used the SPIN! [12] spatial data mining platform,
which has been developed within the EU project IST-1999-10536 SPIN!. It joins
the power of data mining and geographic information systems by the implemen-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

tation of spatial data mining algorithms and a dynamic visualization of spatial

patterns. This allows for a mutual interaction between the system and the user,
between automatically generated hypotheses and user defined hypotheses.
In our project we first analyzed cell potential according to network usage as,
for example, call duration and transmitted data volume. A visualization of the
best 30 subgroups suggested a spatial pattern along interstate highways. Figure
2 (left) shows results for the area of Stuttgart. Dark and light colored cells are
not randomly distributed in space but form chain-like structures, dark colors in-
dicating that cells participate in a high number of subgroup models. The map
was then supplemented with various layers of geographic information, and a co-
incidence with interstate highways (dark lines) became visible. In the next step
we added spatial information, including road network and public transportation
data, to the analysis process. The resulting subgroups confirmed that cells along
interstate highways and railways have an increased demand for mobile network
services and should receive special attention during mobile network planning (see
right plot in Figure 2).

Figure 2. left: subgroup patterns based on network usage in the area of Stuttgart; right: subgroup
patterns based on geographic information in the area of Stuttgart

2. Prediction of Traﬃc Frequencies and Detection of Customer Movement

In this section we present two case studies which involve network data and ge-
ographically referenced time series. The ﬁrst study develops a traﬃc frequency
map and emphasizes the tight integration of feature extraction and data mining
algorithms for performance optimization. The second study extracts customer
movements from tabloid sales data.

2.1. Frequency Map for German Cities

The German outdoor advertisement industry realizes a yearly turnover of about

780 million Euro. Its umbrella organization, which represents a joint market share
of over 80 percent, provides performance indicators for poster sites on which the
pricing of advertisement campaigns is based. The indicator consists of a quantita-
tive and a qualitative measure. The quantitative term states the number of pass-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

ing vehicles, pedestrians and public transport while the qualitative term specifies
the average notice of passers-by. As part of an industrial project we developed
a frequency map for German cities which today forms an essential part of price
calculations in the German outdoor advertisement.
Essential for the prediction of traffic frequencies are the exploitation of geo-
graphic neighborhood, inclusion of background knowledge and performance op-
timization. We therefore applied a modified k-nearest neighbor (kNN) algorithm
[13]. Nearest neighbor algorithms are generally able to incorporate spatial and
non-spatial information based on the definition of appropriate distance functions.
Thus, they are inherently spatial and exploit autocorrelation as a matter of prin-
ciple. In order to gain background knowledge about the vicinity of a street, several
geographically referenced attributes were aggregated. Furthermore, the large do-
main required a tight integration of spatial feature extraction and the algorithm
in order to reduce expensive spatial operations.
The input data comprised several sources of different quality and resolution.
The primary objects of interest were street segments, which generally denote
a part of street between two intersections. Each segment possessed a geometry

object and had attached information about the type of street, direction, speed
class etc. Germany contains in total 6.2 million street segments, for which about
100,000 traffic measurements were available. In addition, demographic and socio-
economic data about the vicinity as well as nearby POI were known. Demographic
and socio-economic data usually exist in aggregated form, for example, for official
districts like post code areas. This information was likewise assigned to all street
segments in an area. In contrast, POI are point data that mark attractive places
like railway stations or restaurants. Clearly, areas with a high POI density are
more frequented than areas with a low density. In order to obtain density infor-
mation the POI data were aggregated. Two basic aggregation methods are buffers
and drive time zones. As explained earlier, they mark, for a fixed location, the
area which lies within a given range or which can be reached within a given time
respectively. Drive time zones emphasize network constraints related to topology
and allowed speed. Imagine, for example, two locations on opposite sides of a
river. Their spatial distance is small, but the travel time between them depends
on the location of the next bridge. For our application we created buffers around
each street segment and calculated the number of relevant POI.
The central part of our traffic frequency prediction is a modified kNN al-
gorithm, which models geographic space as a subcomponent of the general at-
tribute space. The distance between two segments xa and xb is defined as the
(normalized) sum of absolute distances of their attributes
m
d(xa , xb ) = |xai − xbi |.
i=1

For ﬁne tuning, the attributes were assigned domain dependent weights, which
we will not discuss here further. The frequency y0 of a street segment is calculated
as the normalized weighted sum of frequencies from the k nearest neighbors, each
weight indirectly proportional to the distance between the two segments
k
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

wi y i 1
y0 = i=1
k
with wi = .
i=1 wi d(x0 , xi )

The kNN algorithm is known to use extensive resources as the distances be-
tween each street segment and all available measurements have to be calculated.
For a city like Frankfurt this amounts to 43 million calculations (about 21,500
segments and 2,000 measurements). While diﬀerences in numerical attributes can
be determined very fast, the geographic distance between line segments is compu-
tationally expensive. We therefore implemented the algorithm to perform a dy-
namic and selective calculation of distance from each street segment to the various
measurement locations. First, at any time distances to only the top k neighbors
are stored, replacing them dynamically during the iteration over measurement
sites. Second, a step-wise calculation of distance is applied. If the summarized
distance of all non-spatial attributes already exceeds the maximal total distance
of the current k neighbors, the candidate neighbor can be safely discarded and no
spatial calculation is necessary. Else, the distance between the minimum bound-
ing rectangles (MBRs) of the line segments is calculated. The MBR distance is a

lower bound for the actual distance between the line segments and less expensive
to calculate. Again, if the distance of the non-spatial attributes plus the distance
between the MBRs is greater or equal to the threshold, the instance can be dis-
carded. Only if both tests are passed, the actual spatial distance is determined.
For the city of Frankfurt, this integrated approach sped up calculations from
nearly one day to about two hours. In addition, the dynamic calculations reduced
the required disc space substantially.

2.2. Customer Movement Detection in Sales Data

In recent years, companies have spent great effort to systematically profile their
customers in order to gain insights into target group characteristics and shopping
behavior. Newspaper companies are especially interested in purchasing behavior
as they face the challenge to supply each point of sale (POS) with the number
of copies that are expected to be sold the next day. Existing forecasting systems
employ complex time series models to predict the expected sales. However, they
are bound to the temporal dimension and lack the understanding of local market
situations and the customers’ movement behavior at a particular selling point.
Clearly, closures and sellouts influence shopping behavior and lead to one key
business question: Where do readers buy a specific newspaper if their preferred
shop is closed or has no more copies left? In an industry project with the publisher
of Europe’s leading tabloid newspaper, we developed a spatial model to detect
and visualize local customer behavior.
The data basis for our model were approximately 110,000 POS, irregularly
distributed over Germany. For each object a triannual history of sales figures
was available. All objects were equipped with location information and could be
mapped to a network of street segments. Information about the street network
restrains vehicular as well as pedestrian movement and therefore simplifies the ge-
ographic space of possible movement. In addition, socio-demographic data about
the vicinity of a POS as well as nearby POI were known. Both are needed to bet-
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

ter understand, explain and learn the movement behavior of local target groups.
For example, certain patterns or habits might correlate with certain demographic
attributes or POI.
Shopping behavior is influenced by intrinsic as well as extrinsic factors [14,15].
This includes the individual destination, spatial barriers, mood (activation) and
available selling points. In our model we assume that readers follow some routine.
For example, the reader may buy the newspaper at his/her preferred selling point
along his/her way to work. Such a routine can easily be interrupted by external
factors as, for example, sellouts, vacation or openings of new shops requiring the
customer to adapt his/her behavior. The challenge of the project was to detect,
quantify and learn the behavior of customers after any such event and to predict
the amount of copies that are additionally sold in alternative shops. Clearly,
without personalized data customer movement can hardly be traced over a whole
city. We therefore restricted our analysis to the local environment of a POS.
The first task in learning local movement patterns was to define a reasonable
spatial unit for movement detection, which we call movement space (see left plot
in Figure 3). If the unit is set too large, movement patterns will be lost in general

noise or overlaid by side eﬀects due to events at other POS. Limiting the space
too strongly, however, reduces the chance to detect reasonable movement patterns
within. We employed two criteria to deﬁne the size of the unit, namely drive
time zones and Voronoi neighbors. Drive time zones were used to set the initial
(and maximal) extent of the movement space according to typical pedestrian
walking speed. This area was further restricted based on the assumption that
people who immediately seek an alternative POS will not pass by two alternative
POS without buying. Of course, the individual choice depends on the knowledge
of each customer about the set of selling points in his/her range (awareness set).
In order to limit the movement space, we calculated the convex hull of the second
order POS Voronoi neighbor (see right plot in Figure 3). The resulting area was
the space in which we looked for additional newspaper sales as an indicator for
movement if the service at some POS had been unavailable. We call the set of all
POS inside the movement space optional shops.

Figure 3. left: movement space of a particular POS showing the convex hull of the second order
Voronoi neighbor (dark gray area) and the initial drive time zone of the POS (bold gray street
segments); right: second order Voronoi neighbors of a POS with respect to natural barriers

The basic idea to detect local movement patterns in case of a changed local
market situation (closures, sellouts, etc.) is to predict the sales of all optional
shops assuming a typical shopping behavior and to compare the prediction with
their actual sales. All shops showing an increased sale are likely to gain customers
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

from the considered shop. In order to predict the expected number of copies at
some POS, we calculated the sales based on shops with similar selling trends in
the recent past. These shops are called reference shops. The reference shops were
dynamically determined by maximizing the similarity in selling trends applying a
two week window before the registered event of the original POS. In this way, also
seasonal or regional trends could be anticipated. Of course, all reference shops
have to be located outside the movement space in order to be independent of any
event-driven movement caused by the POS under consideration. If an optional
shop sells a certain amount of copies above the expected number, it is likely that
customers of the considered POS buy their newspaper alternatively at that point.
Over time we gain robust knowledge about the movement behavior of the local
customer base as well as a set of alternative shops inside the movement space.
With this knowledge newspaper companies can optimize the number of copies
they deliver to each POS, taking into account not only time variant information
but also the current local market situation. Moreover, the information about cus-
tomer behavior provided by movement spaces allows to optimize location planning
and to calculate the eﬀect of opening or closing a POS in a speciﬁc area.

3. Mobility Mining in Outdoor Advertisement

Over the past five years GPS technology has steadily conquered its place in mass
market and is on the threshold to become an every day companion. Besides the
application in navigation systems, enterprises have also recognized the value of
movement histories. The outdoor advertising industries of Germany and Switzer-
land commissioned nationwide GPS field studies to collect representative samples
of mobile behavior. The data are used to calculate reach and gross contacts of
poster campaigns for specific target populations.
This section describes a general approach for mobility mining in outdoor ad-
vertisement and highlights challenges of current industrial projects for Arbeitsge-
meinschaft Media-Analyse e.V. (ag.ma) in Germany and Swiss Poster Research
Plus (SPR+) in Switzerland.

3.1. Modeling of Poster Reach

The reach of a campaign states the percentage of population which has at least
one contact with any poster of the campaign within a speciﬁed period of time.
Poster reach allows to determine the optimal duration of some advertisement and
to tune the conﬁguration of poster networks as it expresses the publicity of some
location and the spread of information within the population.
Given trajectories for a sample of the population and geographic coordinates
of poster locations, the contacts with a given poster campaign can be extracted by
spatial intersection and the reach can be determined. One challenge of calculating
poster reach lies in the incompleteness of sample trajectories. For example, many
trajectories are incomplete due to technical defects or because people forget (to
switch on) their GPS devices. In addition, people tend to drop out of the study
early, which leads to a decreasing number of participants with advancing time.
What possibilities exist to handle incomplete data? In general, missing data can
be treated in the data preparation step or within the modeling process. During
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

data preparation, incomplete data objects can either be removed, ignored or tried
to fill in by modeling. However, none of these possibilities are practicable in our
application. First, if incomplete data objects are removed, the size of the data
set decreases drastically because only a few test persons produce trajectories for
the whole surveying period. Second, ignoring missing data leads to an underes-
timation of poster contacts and thus to an underestimation of poster reach as
well. Finally, the reconstruction of missing trajectories is a fairly complex and
ambitious task. We therefore treat missing data explicitly in the modeling step,
applying a technique from the area of event history analysis.
Event history analysis (also survival analysis) [16] is a branch of statistics
that investigates the probability of some event or the amount of time until a
specific event occurs. It is usually applied in clinical studies and quality control
where an event denotes, for example, the occurrence of some disease or the failure
of a device. In our application an event denotes the first contact of a test person
with a poster campaign. To calculate poster reach, we apply the Kaplan-Meier
method which allows for censored data. This method adapts to differing sample
sizes by calculating conditional probabilities between two consecutive events. If

no more data of a test person are available, the person is assumed to survive until
the next event occurs and is censored afterwards. Thus, a gradual adjustment to
the actual number of people in the sample is achieved.

3.2. Integration of Heterogeneous Mobility Data

The ag.ma, a joint industry committee of German advertising vendors and cus-
tomers, commissioned a nationwide survey to collect mobility data using two
different surveying technologies. From a total of about 30,000 test persons, one
third was provided with GPS devices while the other test persons where queried
about their movements in a Computer Assisted Telephone Interview (CATI). One
task of the project was to analyze both data sets according to their content and
structure-related differences, and to combine the data sets for modeling if possible.
Both surveying techniques bear the risk of incomplete and erroneous data.
GPS devices may easily be forgotten or passed on to other family members while
telephone interviews demand a precise and complete recollection of the activities
of the previous day. We therefore compared the mobile behavior of both data sets.
The analysis showed similar movement behavior as, for example, in the average
number of trips per day or the average distance traveled.
The main structural difference of the data sets are the different surveying
periods. While all GPS test persons collected data over a period of one week,
CATI test persons were asked about their movements on the previous day of the
interview only. However, a combination of both data sets with regard to their
structure was possible due to the adaptive character of Kaplan-Meier. As Kaplan-
Meier censors missing days, the modeling process is robust against varying lengths
of surveying periods.

3.3. Extrapolation of Reach over Time and Space

the prediction of reach when only a limited number of measurements are available.
The first task is to predict poster reach when the measurement period is shorter
than the desired interval of time. The second challenge is to predict poster reach
in a city where no measurements at all are available. In this case, the reach of a
given campaign within one city has to be inferred from the mobility of another
(similar) city.
For the extrapolation of reach beyond the surveying period, we combine two
different extrapolation techniques. The first technique utilizes the reach of one
week to fit a log-linear function and subsequently extrapolates values for longer
periods. The second technique relies on the assumption of weekly periodic mo-
bility patterns and replicates mobile behavior accordingly. Both techniques are
interweaved according to the stability of available data.
The extrapolation for areas without GPS measurements is a great challenge.
Neither GPS data nor other mobility information as, for example, traffic frequen-
cies are available. In addition, individual poster characteristics which affect the
intensity of a contact need to be taken into account for the calculation of reach.
The extrapolation method therefore consists of three separate steps. First, the

traffic behavior at the poster locations of interest is inferred. Second, the pas-
sages are scaled according to individual poster characteristics. Finally, the reach
of a campaign with a similar contact distribution is assigned to the campaign of
interest. In the first step, various location attributes such as the type of street,
type and number of nearby POI or the size of population define a similarity mea-
sure by which poster passages are extrapolated. In the next step, a scaling factor
which transforms passages into poster contacts is applied. The factor depends on
individual poster characteristics and is determined based on evaluations in GPS
cities. The final assignment of poster reach depends again on a similarity mea-
sure which is defined on the contact distribution of the campaign of interest. The
extrapolation method thus accounts for general traffic characteristics, yet allows
for individual features of poster campaigns. In order to validate our extrapola-
tion method, we applied the technique in a city with GPS measurements. The
comparison of modeled and extrapolated values showed a high correlation.

4. Summary

In this chapter we present a collection of spatial data mining applications which

have been carried out at Fraunhofer IAIS over the past years. The projects demon-
strate the wide applicability of spatial data mining and the various facets of spa-
tial data types, preprocessing methods and algorithms. We begin the chapter by
case studies for marketing and planning, which involve spatial feature extraction
on various levels of aggregation, extensive visual analysis and subgroup discovery.
We then proceed to more advanced data types in form of street networks and
geographically referenced time series. These case studies emphasize the beneﬁt
of specialized algorithms that allow for dynamic and selective computations and
underline the necessity for application dependent deﬁnitions of neighborhood re-
lationships. Finally, we introduce a case study using spatio-temporal trajectories
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

which calls attention to the problem of missing data.

Without question, spatial data mining is an attractive research area with high
impact on industry and many further challenges to meet.

Acknowledgment

The authors would like to thank all business partners for their close cooperation.
The publication of this chapter would have been impossible without their interest
and participation. Parts of this work have been inspired by research within the
EU projects IST-1999-10536 SPIN! (Spatial Mining for Data of Public Interest)
and IST-6FP-014915 GeoPKDD (Geographic Privacy-aware Knowledge Discov-
ery and Delivery). Finally, the authors acknowledge and thank all members of the
Department of Knowledge Discovery who have contributed by their research and
constant work to the success of the presented projects.

References

[1] S. Rinzivillo, F. Turini, V. Bogorny, C. Körner, B. Kuijpers, and M. May. Knowledge

discovery from geographical data. In F. Giannotti and D. Pedreschi, editors, Mobility,
Data Mining and Privacy, chapter 9. Springer, 2008.
[2] W. Tobler. A computer movie simulating urban growth in the Detroit region. Journal of
Economic Geography, 46(2):234–240, 1970.
[3] S. Chawla, S. Shekhar, W. Wu, and U. Ozesmi. Modelling spatial dependencies for min-
ing geospatial data. In H. J. Miller and J. Han, editors, Geographic Data Mining and
Knowledge Discovery, chapter 6. Taylor & Francis, 2001.
[4] A. Getis. Spatial statistics. In P. A. Longley, M. F. Goodchild, D. J. Maguire, and D. W.
Rhind, editors, Geographical Information Systems, volume 1, chapter 16. Wiley & Sons,
second edition, 1999.
[5] N. A. C. Cressie. Statistics for Spatial Data. Wiley & Sons, 1993.
[6] M. Egenhofer. Reasoning about binary topological relations. In O. Gunther and H.-
J. Schek, editors, Proc. of the Second International Symposium on Advances in Spatial
Databases (SSD’91), volume 525, pages 143–160. Springer, 1991.
[7] H. Drucker, C. J. C. Burges, L. Kaufman, A. Smola, and V. Vapnik. Support vector
regression machines. In M. C. Mozer, M. I. Jordan, and T. Petsche, editors, Advances in
Neural Information Processing Systems 9, pages 155–161. MIT Press, 1997.
[8] J. J. Tomas and K. A. Cook. Illuminating the Path: The Research and Development
Agenda for Visual Analytics. IEEE Computer Society, 2005.
[9] G. Andrienko, N. Andrienko, P. Jankowski, D. Keim, M.-J. Kraak, A. MacEachren, and
S. Wrobel. Geovisual analytics for spatial decision support: Setting the research agenda.
International Journal of Geographical Information science (IJGIS), 21(8), 2007.
[10] W. Klösgen. Subgroup discovery. In W. Klösgen and J. Zytkow, editors, Handbook of
Data Mining and Knowledge Discovery, chapter 16.3. Oxford University Press, 2002.
[11] W. Klösgen and M. May. Spatial subgroup mining integrated in an object-relational
spatial database. In Proc. of the 6th European Conference on Data Mining and Knowledge
Discovery (PKDD’02), pages 275–286, 2002.
[12] M. May and S. Savinov. SPIN! - an enterprise architecture for spatial data mining. In
Proc. of the 7th International Conference on Knowledge-Based Intelligent Information &
Engineering Systems (KES’03), pages 510–517. Springer, 2003.
[13] M. May, S. Scheider, R. Rösler, D. Schulz, and D. Hecker. Pedestrian ﬂow prediction
in extensive road networks using biased observational data. In Proc. of the 16th ACM
SIGSPATIAL International Conference on Advances in Geographic Information Systems
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

(ACM GIS’08), pages 1–4. ACM, 2008.

[14] N. Andrienko and G. Andrienko. Designing visual analytics methods for massive collection
of movement data. Cartographica, 42(2):117–138, 2007.
[15] S. Dodge, R. Weibel, and A. K. Lautenschütz. Taking a systematic look at movement:
Developing a taxonomy of movement patterns. In Workshop on GeoVisualization of
Dynamics, Movement and Change (GeoVis), 2008.
[16] D. G. Kleinbaum and M. Klein. Survival Analysis. Statistics for Biology and Health.
Springer, 2005.

Subject Index
algorithms 164 forecasting 137
business applications 149 functional data analysis 137
business intelligence 149 hierarchical clustering 84, 99
case studies 164 human computer interaction 17
churn 77 inference 149
classification 99 interactivity 17
competitive and market load profiles 99
intelligence 149 medical knowledge discovery 110
customer churn 77 NLP 149
data cleaning 84 online auctions 137
data mining outlier detection 84
~ applications 1 outlier ranking 84
~ process 1 performance support 149
~ stakeholders 49 PKB 110
spatial ~ 164 retail banking 77
utility-based ~ 49 rigor vs. relevance in research 49
dynamics 137 semantic models 149
eBay 137 subgroup discovery 17
electricity markets 99
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Author Index
Ahola, J. 77 May, M. 164
Blumenstock, A. 17 Morrall, R. 110
Breitenbach, M. 123 Mueller, M. 17
Brennan, T. 123 Mutanen, T. 77
Bruckhaus, T. 66 Nousiainen, S. 77
Ceglar, A. 110 Pechenizkiy, M. 49
Dieterich, W. 123 Puuronen, S. 49
Domingos, R. 35 Roddick, J.F. 110
Figueiredo, V. 99 Rodrigues, F. 99
Ghani, R. v, 1 Scheider, S. 164
Grudic, G. 123 Schulz, D. 164
Guthrie, W.E. 66 Shmueli, G. 137
Hecker, D. 164 Soares, C. v, 1, 84
Hipp, J. 17 Stange, H. 164
Jank, W. 137 Torgo, L. 84
Kass, A. 149 Vale, Z. 99
Kempe, S. 17 Van de Merckt, T. 35
Körner, C. 164 Wirth, R. 17
Krause-Traudes, M. 164 Wrobel, S. 164
Lanquillon, C. 17 Yeh, P.Z. 149
Copyright © 2010. IOS Press, Incorporated. All rights reserved.

Content Delivery Networks (Lecture Notes in Electrical Engineering)
No ratings yet
Content Delivery Networks (Lecture Notes in Electrical Engineering)
429 pages
Quantum Technology
No ratings yet
Quantum Technology
26 pages
208B Manual de Vuelo PDF
100% (1)
208B Manual de Vuelo PDF
846 pages
6 Month MCQs (Oct To May 25) English
No ratings yet
6 Month MCQs (Oct To May 25) English
197 pages
Easy Love Spell
50% (2)
Easy Love Spell
2 pages
Information and Randomness-An Algorithmic Perspective
0% (1)
Information and Randomness-An Algorithmic Perspective
487 pages
AI Human Computing
100% (1)
AI Human Computing
372 pages
Divinity Activation Mantras Empowerment
0% (2)
Divinity Activation Mantras Empowerment
2 pages
Wa0034.
100% (1)
Wa0034.
256 pages
(Ebook PDF) Introduction To Business Data Mining 1st Editioninstant Download
100% (3)
(Ebook PDF) Introduction To Business Data Mining 1st Editioninstant Download
44 pages
HumanAugementation v1.0 0
No ratings yet
HumanAugementation v1.0 0
96 pages
Ethical and Sustainable Quantum Computing
No ratings yet
Ethical and Sustainable Quantum Computing
15 pages
Neuromorphic Electronic Systems: Carver Mead
No ratings yet
Neuromorphic Electronic Systems: Carver Mead
8 pages
Next-Generation Healthcare Enabling Technologies For Emerging Bioelectromagnetics Applications
No ratings yet
Next-Generation Healthcare Enabling Technologies For Emerging Bioelectromagnetics Applications
28 pages
Frequencies Used in Telecommunications An Integrated Radiobiological Assessment
No ratings yet
Frequencies Used in Telecommunications An Integrated Radiobiological Assessment
198 pages
Differential Equations For Engineering Science 2014 by Serdar Yüksel
100% (1)
Differential Equations For Engineering Science 2014 by Serdar Yüksel
52 pages
Lesson 1 (Week 1 (Area of Trapezium and Triangle)
No ratings yet
Lesson 1 (Week 1 (Area of Trapezium and Triangle)
3 pages
New Results Bolster Penrose's Quantum Consciousness Hypothesis
No ratings yet
New Results Bolster Penrose's Quantum Consciousness Hypothesis
7 pages
CRM Section Two
No ratings yet
CRM Section Two
4 pages
Detecting AI-Synthesized Speech Using Bispectral Analysis
No ratings yet
Detecting AI-Synthesized Speech Using Bispectral Analysis
6 pages
Deloitte Full Test 1 Q
No ratings yet
Deloitte Full Test 1 Q
13 pages
Brain Computer Interface Seminar
No ratings yet
Brain Computer Interface Seminar
19 pages
BYK E-Prospectus of PDF
No ratings yet
BYK E-Prospectus of PDF
9 pages
Case-Based Reasoning Research and Development: Ian Watson Rosina Weber
No ratings yet
Case-Based Reasoning Research and Development: Ian Watson Rosina Weber
360 pages
Recent Advances MPI
No ratings yet
Recent Advances MPI
464 pages
SPEECH
100% (1)
SPEECH
17 pages
ProdEconR PDF
No ratings yet
ProdEconR PDF
370 pages
User Modeling, Adaptation and Personalization: Francesco Ricci Kalina Bontcheva Owen Conlan Séamus Lawless
No ratings yet
User Modeling, Adaptation and Personalization: Francesco Ricci Kalina Bontcheva Owen Conlan Séamus Lawless
416 pages
Model Predictive Control of Modern High-Degree-of-Freedom Turboch
No ratings yet
Model Predictive Control of Modern High-Degree-of-Freedom Turboch
179 pages
6G Vision Document of India - 2 Nov 2022
No ratings yet
6G Vision Document of India - 2 Nov 2022
181 pages
An Application of Selected Artificial Intelligence Techniques To Engineering Analysis
No ratings yet
An Application of Selected Artificial Intelligence Techniques To Engineering Analysis
153 pages
Quantum Computing.8166951.Powerpoint
No ratings yet
Quantum Computing.8166951.Powerpoint
7 pages
Download
No ratings yet
Download
91 pages
Voice Biometrics
100% (1)
Voice Biometrics
12 pages
Sent I C Computing
No ratings yet
Sent I C Computing
177 pages
(Ebook PDF) Introduction To Business Data Mining 1St Edition Download
No ratings yet
(Ebook PDF) Introduction To Business Data Mining 1St Edition Download
44 pages
The Emissions Gap Report 2017 - A UN Environment Synthesis Report
No ratings yet
The Emissions Gap Report 2017 - A UN Environment Synthesis Report
116 pages
Robust Malware Detection For Iot Devices Using Deep Eigenspace Learning
No ratings yet
Robust Malware Detection For Iot Devices Using Deep Eigenspace Learning
12 pages
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
No ratings yet
Book 3 Unit 8. Communicating With Staff: Group Name: 4 Arya Nugroho Indri Novianti Rahayu Yiyin
10 pages
Engo
No ratings yet
Engo
33 pages
Orson Welles' Memo On by Lawrence French
100% (1)
Orson Welles' Memo On by Lawrence French
41 pages
Advances in Space Quantum Communications
No ratings yet
Advances in Space Quantum Communications
36 pages
Ict2611 Octnov24
No ratings yet
Ict2611 Octnov24
15 pages
Emotion Based Music System
No ratings yet
Emotion Based Music System
51 pages
ReportPdfResponseServlet - 2024-12-20T111226.809
No ratings yet
ReportPdfResponseServlet - 2024-12-20T111226.809
9 pages
Tetrapod Scheme
No ratings yet
Tetrapod Scheme
1 page
OS Process Synchronization Unit 3
No ratings yet
OS Process Synchronization Unit 3
55 pages
Bioregenerative Applications of The Human Mesenchymal Stem Cell Derived Secretome Part I
No ratings yet
Bioregenerative Applications of The Human Mesenchymal Stem Cell Derived Secretome Part I
18 pages
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
No ratings yet
The Travelers Property Casualty Co. v. Saint-Gobain Technical Fabrics Canada Ltd.
11 pages
Deborah L. McGuinness Explaining Complex Systems
No ratings yet
Deborah L. McGuinness Explaining Complex Systems
30 pages
Solidworks Note PDF
No ratings yet
Solidworks Note PDF
43 pages
Chapter 12.2 - Financial Statements
No ratings yet
Chapter 12.2 - Financial Statements
10 pages
Prewedding Catalog 2023
No ratings yet
Prewedding Catalog 2023
8 pages
Advances in Healthcare Electronics Enabled by Triboelectric Nanogenerators
No ratings yet
Advances in Healthcare Electronics Enabled by Triboelectric Nanogenerators
19 pages
Ai Syllabus
No ratings yet
Ai Syllabus
5 pages
Simulation and Performance Evaluation of Battery Based Stand-Alone Photovoltaic Systems of Malawi
No ratings yet
Simulation and Performance Evaluation of Battery Based Stand-Alone Photovoltaic Systems of Malawi
89 pages
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
No ratings yet
Musiclm: Generating Music From Text: Google-Research - Github.Io/Seanet/Musiclm/Examples
15 pages
Cepstrum Analysis
No ratings yet
Cepstrum Analysis
13 pages
EI6401-Transducer Engineering PDF
No ratings yet
EI6401-Transducer Engineering PDF
16 pages
1 s2.0 S0029801822002840 Main
No ratings yet
1 s2.0 S0029801822002840 Main
13 pages
Data Ethics Framework 2
No ratings yet
Data Ethics Framework 2
23 pages
Big Data Gets Personal
No ratings yet
Big Data Gets Personal
29 pages
Articulo 6 - Non-Invasive - Flexible - and - Stretchable - Wearable - Sensors - With - Nano-Based - Enhancement - For - Chronic - Disease - Care
No ratings yet
Articulo 6 - Non-Invasive - Flexible - and - Stretchable - Wearable - Sensors - With - Nano-Based - Enhancement - For - Chronic - Disease - Care
38 pages
CT TIF Presentation For Kickoff-Final
No ratings yet
CT TIF Presentation For Kickoff-Final
13 pages
Surface ReConstruction of Dicom Images Final
No ratings yet
Surface ReConstruction of Dicom Images Final
8 pages
(IJCST-V10I5P42) :mrs R Jhansi Rani, Manchinti Pavan Kumar Reddy
No ratings yet
(IJCST-V10I5P42) :mrs R Jhansi Rani, Manchinti Pavan Kumar Reddy
8 pages
Generative Evaluation of Audio Representations
No ratings yet
Generative Evaluation of Audio Representations
17 pages
A Blueprint For A Better World - Chapter 02 - Planet Earth Is Alive - But Not Well
No ratings yet
A Blueprint For A Better World - Chapter 02 - Planet Earth Is Alive - But Not Well
10 pages
Mind Reading Computer
No ratings yet
Mind Reading Computer
16 pages
Artifical Intelligence Unit 1
No ratings yet
Artifical Intelligence Unit 1
21 pages
Writing Your First Django App, Part 7 - Django Documentation - Django
No ratings yet
Writing Your First Django App, Part 7 - Django Documentation - Django
10 pages
Force of Friction
No ratings yet
Force of Friction
30 pages
Satyam Cnlu Torts Roughdraft
No ratings yet
Satyam Cnlu Torts Roughdraft
4 pages
Stephen Hawking - 'Transcendence Looks at The Implications of Artificial Intelligence - But Are We Taking AI Seriously Enough - X 4'
No ratings yet
Stephen Hawking - 'Transcendence Looks at The Implications of Artificial Intelligence - But Are We Taking AI Seriously Enough - X 4'
2 pages
Compliance Evaluation of BTS ICNIRP Guidelines.: As Per
No ratings yet
Compliance Evaluation of BTS ICNIRP Guidelines.: As Per
27 pages
A Modal-Based Initial Estimate For The Newton Solution of Ill-Conditioned Large-Scale Power Flow Problems
No ratings yet
A Modal-Based Initial Estimate For The Newton Solution of Ill-Conditioned Large-Scale Power Flow Problems
4 pages
A Federated Transfer Learning Framework For Secure Image Steganalysis
No ratings yet
A Federated Transfer Learning Framework For Secure Image Steganalysis
11 pages
6G Technologies: Key Drivers, Core Requirements, System Architectures, and Enabling Technologies
No ratings yet
6G Technologies: Key Drivers, Core Requirements, System Architectures, and Enabling Technologies
10 pages
Week 5 MODULE PURPOSIVE COMMUNICATION
No ratings yet
Week 5 MODULE PURPOSIVE COMMUNICATION
13 pages
Medical Appointment Application: Acta Electronica Malaysia (AEM)
No ratings yet
Medical Appointment Application: Acta Electronica Malaysia (AEM)
5 pages
Quantum Secure Communication
No ratings yet
Quantum Secure Communication
2 pages
802.1.4 and 802.15.6 PDF
No ratings yet
802.1.4 and 802.15.6 PDF
5 pages
The IEEE 802.15.4 For Wireless Sensor Network
No ratings yet
The IEEE 802.15.4 For Wireless Sensor Network
3 pages
Smarter IT: Optimize IT Delivery, Accelerate Innovation: Inside
No ratings yet
Smarter IT: Optimize IT Delivery, Accelerate Innovation: Inside
15 pages
Project Report On Dna Computing: Information Technology For Managers
No ratings yet
Project Report On Dna Computing: Information Technology For Managers
12 pages
301 ExperimentsWithAuras
No ratings yet
301 ExperimentsWithAuras
5 pages
Nano-Medicine Based Drug Delivery System
No ratings yet
Nano-Medicine Based Drug Delivery System
13 pages
Physiology Pneumonics
No ratings yet
Physiology Pneumonics
9 pages
Mental Math Slide Show
No ratings yet
Mental Math Slide Show
22 pages
Neat versus Scruffy: Fundamentals and Applications
From Everand
Neat versus Scruffy: Fundamentals and Applications
Fouad Sabry
No ratings yet