Document 3
Document 3
Document 3
Discovering Knowledge in Data, an Introduction to Data Mining. By Daniel T. Larose, Ph.D., John Wiley &
Sons Inc., December 2004, 222 pages. Price $69.95, ISBN 0471666572
Data Mining is a new area mathematics that combines knowledge and techniques from such diverse
fields as statistics, computer science, expert systems, machine learning, and database science. Because it
is a new field, universities are only now beginning to add courses and degrees dedicated to this field of
study. In addition, because there are so few courses, there is only a limited selection of texts from which
to choose. In particular, there are few texts that are appropriate for use in an introductory course. Dr. Dan-
iel Larose addresses this issue with his latest text: Discovering Knowledge in Data: An Introduction to Data
Mining. This text is written to be used in an introduction to data mining course for graduate and upper level
undergraduate students who may come from a wide variety of academic backgrounds. Prof. Larose has
considerable experience in this area, as he was the founder and current director of Central Connecticut
State UniversityÕs graduate program in data mining (http://www.ccsu.edu/datamining).
The text is grouped into 11 chapters based upon the Cross Industry Standard Process for Data Mining
(CRISP-DM) approach to data mining. The first chapter gives an overview of CRISP and clearly describes
each of the steps. Following this introduction, the second and third chapters focus on the data understand-
ing steps of CRISP. The next seven chapters focus on different modeling techniques, and finally the last
chapter describes model evaluation techniques. A strength of this bookÕs organization is the authorÕs ‘‘white
box’’ approach to data mining. In this approach, the author steps the reader through many of the modeling
algorithms and gives case studies of real world data mining applications. Furthermore, the author provides
examples using actual large-scale data sets using real world mining tools such as SPSS Clementine, SAS
Enterprise Miner, and Insightful Miner. The text contains over 90 exercises that allow the student to
assess their level of understanding, and there are also over 70 ‘‘hands on’’ analysis problems so that the
student can learn by actually doing real data mining. Another strength in the organization of this book
is that each of the chapters are ‘‘quick reads’’ and do not require extensive time for the student to grasp
the material.
As stated above, Chapters 2 and 3 deal with preprocessing and analyzing the data respectively. The
author does a good job with these chapters, particularly Chapter 3. This is important because in the opinion
of this reviewer, 90% of the success in building a good model comes from accurate data scrubbing and data
analysis. In Chapter 2, the author presents data scrubbing, specifically handling missing data and identify-
ing outliers. Although this chapter is good in general, the section on missing data should have been
expanded. For example, the text gives three methods for filling in missing data: (1) use a predefined con-
stant (2) use an average value such as mean, median, or mode, or (3) use a random variable, but does
not discuss other important and widely used techniques of data imputation such as building a model (usu-
ally a tree) to fill in the missing values. One of the better features of Chapter 2 was the authorÕs discussion of
outliers. The author clearly and concisely describes why identifying outliers is important and methods for
handling them. This is important because ignoring outliers is a common mistake of people new to data
mining.
In Chapter 3, the author describes Exploratory Data Analysis (EDA) techniques. This section is partic-
ularly well done and even seasoned veterans of data mining could benefit from reading this chapter. In this
section, the author uses a data set from a cell phone company to predict customer attrition (also known as
‘‘churn’’). Throughout this chapter the author methodically goes through all of the variables and examines
each of them for correlation and predictability using a range of techniques. By the end of the chapter, the
author has succeeded in identifying variables that are predictive of churn and variables that are highly cor-
related with one another. From these results, the author is able to prune a large data set down to a small
number of predictive variables from which a model can be built. The chapter is basically a ‘‘cookie cutter’’
approach to EDA, and a student can apply the skills learned in this chapter to nearly any other data set.
1308 Book reviews / Information Processing and Management 41 (2005) 1299–1309
This is an important topic and the author did a good job on it. The only criticism of this section would be
that the graphs should have been presented in color. Doing so would have made this chapter even more
instructive.
Chapters 4 through 10 deal with building models once the data has been scrubbed and analyzed. The
author devotes one chapter each to the most important modeling approaches including statistics, decision
trees, clusters, and association rules. In Chapter 4, the author starts off with the techniques that are prob-
ably most familiar to students: statistical models. The author gives a brief but thorough discussion of the
different statistical techniques used in data mining including univariate and multivariate regression. This
chapter is not meant to be comprehensive because the material contained in it could easily fill several college
courses. However, the student reading this chapter will have a good understanding of the basics and will be
able to build models using any of the aforementioned data mining tools. A small criticism of this particular
chapter is that the author neglects to discuss logistic regression. This is an important topic, particularly
when dealing with Boolean target variables such as the previously mentioned ‘‘cell phone company churn
data’’. Possibly the author is leaving this for the next volume in the series, but it could have been covered in
this text since it is not too difficult to understand.
In the next three chapters, the author discusses other modeling techniques, the most important of which
are decision trees and neural networks. The chapter on decision trees was well done. It covers both the
CART and the C4.5 algorithms, and describes their differences and similarities. The chapter is short and
to the point and makes good use of an example to explain the concept of trees. Students should have no
difficulty understanding the material and should be able to develop their own decision tree models and ex-
plain the results after reading this chapter. The chapter on neural networks was particularly impressive.
This is a difficult concept, but it was explained in simple terms and was not unnecessarily technical. Most
importantly, to truly understand neural networks, one must understand what is happening within the
‘‘black box’’ of the net. To explain this, Dr. Larose actually goes through a single step of adjusting the inter-
nal weights of a neural network during the training phase. A student can follow along, see how the internal
weights are trained and then understand how a neural network can ‘‘learn’’ when this process is applied
hundreds of thousands of times.
Another frequently used analysis technique in data mining is clustering. The author devotes a chapter
each to the two most widely employed methodologies: k-means clustering and Kohonen neworks. Both
are explained clearly and examples are given. After the clusters are generated, the author describes how
to interpret and use the output. The chapters are valuable and should convince the students that this is
a tool that should almost always be applied as part of any data mining project. A criticism of both chapters
was that the author did not provide enough analysis and interpretation of the clusters. The analysis that
was provided was quite good, but it was too brief. Adding a few more pages of discussion of the clusters
would have been useful.
Finally, the last chapter on modeling dealt with rule generation. It was pleasantly surprising to see a
chapter on this subject, because this is an introductory text and the concept can be difficult. However, it
is an important technique particularly with market basket analysis so it was good that it was included.
As with the other chapters before it, the author does not make the mistake of going into too great of depth.
Instead, he focuses on presenting the key concepts and presenting enough information that a student can
develop models competently and explain the results.
In the last chapter of the book, the author gives an overview of standard methods for evaluating math-
ematical models. The author was successful in presenting these ideas, and they should be easily understood.
Furthermore, the author is presents the concept of cost benefit analysis. Many times, a person new to data
mining will make the mistake of thinking that the most accurate model is the best while completely ignoring
the fact that a less accurate model might actually generate more profit. So, along with the standard lift
charts, gains charts, and confusion matrices, the author presents the concept of calculating the costs of a
‘‘false positive’’ and a ‘‘false negative’’ and the benefits of a correct answer.