Data Mining Report (Final) 1
Data Mining Report (Final) 1
IN
DATA MINING
A
Minor Project
MASTER OF TECHNOLOGY
IN
COMPUTER ENGINEERING
Under the Guidance of Submitted by
2
Table of Contents
Chapter 1 Introduction
Chapter 2 Literature Survey
Chapter 3 Tools/Technologies to be used
Chapter 4 Various Data Mining Techaniques
Chapter 5 Applications Of Data Mining
Future Scope and Work
List of References
3
List of figures
Figure 1.1 Histogram shows the number of customers with various eye colors
Figure 1.2 Histogram shows the number of customers of different ages and quickly tells
the viewer that the majority of customers are over the age of 50.
Figure 1.3 Linear regression is similar to the task of finding the line that minimizes the
total distance to a set of data.
Figure 2.1 A decision tree is a predictive model that makes a prediction on the basis of a
series of decision much like the game of 20 questions.
Figure 2.2 Neural networks can be used for data compression and feature extraction
4
List of Tables
Table 2.1 Decision tree algorithm segment. This segment cannot be split further except
by using the predictor "name".
5
ACKNOWLEDGEMENT
A journey is easier when you travel together. Interdependence is certainly more valuable
than independence. It is a pleasant aspect that I have now the opportunity to express my
gratitude for all of them.
I would like to express my deep and sincere gratitude to my supervisors Er. Gianetan
Singh Sekhon, Lecturer, Computer Science & Engineering Department, YCoE, Talwandi
Sabo. Their wide knowledge and logical way of thinking have been of great value for me.
Their understanding, encouraging and personal guidance even beyond the duty hours have
provided a good basis for the present project work. I would have been lost without them.
(Hans Raj)
6
Chapter 1
Introduction
Data mining is primarily used today by companies with a strong consumer focus - retail,
financial, communication, and marketing organizations. It enables these companies to determine
relationships among "internal" factors such as price, product positioning, or staff skills, and
"external" factors such as economic indicators, competition, and customer demographics. And, it
enables them to determine the impact on sales, customer satisfaction, and corporate profits.
Finally, it enables them to "drill down" into summary information to view detail transactional
data. With data mining, a retailer could use point-of-sale records of customer purchases to send
targeted promotions based on an individual's purchase history. By mining demographic data
from comment or warranty cards, the retailer could develop products and promotions to appeal
to specific customer segments. Data mining is the process of extracting patterns from data. As
more data are gathered, with the amount of data doubling every three years, [1] data mining is
becoming an increasingly important tool to transform these data into information. It is
commonly used in a wide range of profiling practices, such as marketing, surveillance, fraud
detection and scientific discovery.While data mining can be used to uncover patterns in data
samples, it is important to be aware that the use of non-representative samples of data may
produce results that are not indicative of the domain. Similarly, data mining will not find
patterns that may be present in the domain, if those patterns are not present in the sample being
"mined". There is a tendency for insufficiently knowledgeable "consumers" of the results to
attribute "magical abilities" to data mining, treating the technique as a sort of all-seeing crystal
ball. Like any other tool, it only functions in conjunction with the appropriate raw material: in
this case, indicative and representative data that the user must first collect. Further, the discovery
of a particular pattern in a particular set of data does not necessarily mean that pattern is
representative of the whole population from which that data was drawn. Hence, an important
part of the process is the verification and validation of patterns on other samples of data.
7
Chapter 2
Literature Survey
Data mining, the extraction of hidden predictive information from large databases, is a
powerful new technology with great potential to help companies focus on the most
important information in their data warehouses. Data mining tools predict future trends and
behaviors, allowing businesses to make proactive, knowledge-driven decisions. The
automated, prospective analyses offered by data mining move beyond the analyses of past
events provided by retrospective tools typical of decision support systems. Data mining
tools can answer business questions that traditionally were too time consuming to resolve.
They scour databases for hidden patterns, finding predictive information that experts may
miss because it lies outside their expectations.Most companies already collect and refine
massive quantities of data. Data mining techniques can be implemented rapidly on existing
software and hardware platforms to enhance the value of existing information resources,
and can be integrated with new products and systems as they are brought on-line. When
implemented on high performance client/server or parallel processing computers, data
mining tools can analyze massive databases to deliver answers to questions such as,
"Which clients are most likely to respond to my next promotional mailing, and why?"
8
Chapter 3
Tools/Technologies to be used
1. Window XP
2. Net Framework 2.0
3. Crytal Report
4. My SQL Server
5. VB .Net
9
CHAPTER 4
VARIOUS DATA MINING TECHANIQUES
1.2 DATA MINING TECHANIQUES
some of the most common data mining algorithms in use today.
Each section will describe a number of data mining algorithms at a high level, focusing on
the "big picture" so that the reader will be able to understand how each algorithm fits into
the landscape of data mining techniques. Overall, six broad classes of data mining
algorithms are covered. Although there are a number of other algorithms and many
variations of the techniques described, one of the algorithms from this group of six is
almost always used in real world deployments of data mining systems.
10
1.2.1.2 Statistics
By strict definition "statistics" or statistical techniques are not data mining. They were
being used long before the term data mining was coined to apply to business applications.
However, statistical techniques are driven by the data and are used to discover patterns and
build predictive models. And from the users perspective you will be faced with a
conscious choice when solving a "data mining" problem as to whether you wish to attack it
with statistical methods or other data mining techniques. For this reason it is important to
have some idea of how statistical techniques work and how they can be applied.
Difference between statistics and data mining?
I flew the Boston to Newark shuttle recently and sat next to a professor from one the
Boston area Universities. He was going to discuss the drosophila (fruit flies) genetic
makeup to a pharmaceutical company in New Jersey. He had compiled the world's largest
database on the genetic makeup of the fruit fly and had made it available to other
researchers on the internet through Java applications accessing a larger relational database.
He explained to me that they not only now were storing the information on the flies but
also were doing "data mining" adding as an aside "which seems to be very important these
days whatever that is". I mentioned that I had written a book on the subject and he was
interested in knowing what the difference was between "data mining" and statistics. There
was no easy answer. The techniques used in data mining, when successful, are successful
for precisely the same reasons that statistical techniques are successful (e.g. clean data, a
well defined target to predict and good validation to avoid overfitting). And for the most
part the techniques are used in the same places for the same types of problems (prediction,
classification discovery). In fact some of the techniques that are classical defined as "data
mining" such as CART and CHAID arose from statisticians.
So what is the difference? Why aren't we as excited about "statistics" as we are about data
mining? There are several reasons. The first is that the classical data mining techniques
such as CART, neural networks and nearest neighbor techniques tend to be more robust to
both messier real world data and also more robust to being used by less expert users. But
that is not the only reason. The other reason is that the time is right. Because of the use of
computers for closed loop business data storage and generation there now exists large
quantities of data that is available to users. IF there were no data - there would be no
11
interest in mining it. Likewise the fact that computer hardware has dramatically upped the
ante by several orders of magnitude in storing and processing the data makes some of the
most powerful data mining techniques feasible today.
The bottom line though, from an academic standpoint at least, is that there is little practical
difference between a statistical technique and a classical data mining technique. Hence we
have included a description of some of the most useful in this section.
What is statistics?
Statistics is a branch of mathematics concerning the collection and the description of data.
Usually statistics is considered to be one of those scary topics in college right up there with
chemistry and physics. However, statistics is probably a much friendlier branch of
mathematics because it really can be used every day. Statistics was in fact born from very
humble beginnings of real world problems from business, biology, and gambling!
Knowing statistics in your everyday life will help the average business person make better
decisions by allowing them to figure out risk and uncertainty when all the facts either
aren’t known or can’t be collected. Even with all the data stored in the largest of data
warehouses business decisions still just become more informed guesses. The more and
better the data and the better the understanding of statistics the better the decision that can
be made.
Statistics has been around for a long time easily a century and arguably many centuries
when the ideas of probability began to gel. It could even be argued that the data collected
by the ancient Egyptians, Babylonians, and Greeks were all statistics long before the field
was officially recognized. Today data mining has been defined independently of statistics
though “mining data” for patterns and predictions is really what statistics is all about.
Some of the techniques that are classified under data mining such as CHAID and CART
really grew out of the statistical profession more than anywhere else, and the basic ideas of
probability, independence and causality and overfitting are the foundation on which both
data mining and statistics are built.
12
heads. This is certainly more true today than it was when the basic ideas of probability
and statistics were being formulated and refined early this century. Today people have to
deal with up to terabytes of data and have to make sense of it and glean the important
patterns from it. Statistics can help greatly in this process by helping to answer several
important questions about your data:
Certainly statistics can do more than answer these questions but for most people today
these are the questions that statistics can help answer. Consider for example that a large
part of statistics is concerned with summarizing data, and more often than not, this
summarization has to do with counting. One of the great values of statistics is in
presenting a high level view of the database that provides some useful information without
requiring every record to be understood in detail. This aspect of statistics is the part that
people run into every day when they read the daily newspaper and see, for example, a pie
chart reporting the number of US citizens of different eye colors, or the average number of
annual doctor visits for people of different ages. Statistics at this level is used in the
reporting of important information from which people may be able to make useful
decisions. There are many different parts of statistics but the idea of collecting data and
counting it is often at the base of even these more sophisticated techniques. The first step
then in understanding statistics is to understand how the data is collected into a higher
level form - one of the most notable ways of doing this is with the histogram.
Histograms
One of the best ways to summarize data is to provide a histogram of the data. In the
simple example database shown in Table 1.1 we can create a histogram of eye color by
counting the number of occurrences of different colors of eyes in our database. For this
example database of 10 records this is fairly easy to do and the results are only slightly
13
more interesting than the database itself. However, for a database of many more records
this is a very useful way of getting a high level understanding of the database.
ID Name Prediction Age Balance Income Eyes Gender
1 Amy No 62 $0 Medium Brown F
2 Al No 53 $1,800 Medium Green M
3 Betty No 47 $16,543 High Brown F
4 Bob Yes 32 $45 Medium Green M
5 Carla Yes 21 $2,300 High Blue F
6 Carl No 27 $5,400 High Brown M
7 Donna Yes 50 $165 Low Blue F
8 Don Yes 46 $0 High Blue M
9 Edna Yes 27 $500 Low Blue F
10 Ed No 68 $1,200 Low Blue M
This histogram shown in figure 1.1 depicts a simple predictor (eye color) which will have
only a few different values no matter if there are 100 customer records in the database or
100 million. There are, however, other predictors that have many more distinct values and
can create a much more complex histogram. Consider, for instance, the histogram of ages
of the customers in the population. In this case the histogram can be more complex but
can also be enlightening. Consider if you found that the histogram of your customer data
looked as it does in figure 1.1
14
Figure 1.1 This histogram shows the number of customers with various eye colors. This
summary can quickly show important information about the database such as that blue
eyes are the most frequent.
Figure 1.2 This histogram shows the number of customers of different ages and quickly
tells the viewer that the majority of customers are over the age of 50.
By looking at this second histogram the viewer is in many ways looking at all of the data
in the database for a particular predictor or data column. By looking at this histogram it is
also possible to build an intuition about other important factors. Such as the average age of
the population, the maximum and minimum age. All of which are important. These
values are called summary statistics. Some of the most frequently used summary statistics
include:
When there are many values for a given predictor the histogram begins to look smoother
and smoother (compare the difference between the two histograms above). Sometimes the
15
shape of the distribution of data can be calculated by an equation rather than just
represented by the histogram. This is what is called a data distribution. Like a histogram a
data distribution can be described by a variety of statistics. In classical statistics the belief
is that there is some “true” underlying shape to the data distribution that would be formed
if all possible data was collected. The shape of the data distribution can be calculated for
some simple examples. The statistician’s job then is to take the limited data that may have
been collected and from that make their best guess at what the “true” or at least most likely
underlying data distribution might be.
Many data distributions are well described by just two numbers, the mean and the
variance. The mean is something most people are familiar with, the variance, however,
can be problematic. The easiest way to think about it is that it measures the average
distance of each predictor value from the mean value over all the records in the database.
If the variance is high it implies that the values are all over the place and very different. If
the variance is low most of the data values are fairly close to the mean. To be precise the
actual definition of the variance uses the square of the distance rather than the actual
distance from the mean and the average is taken by dividing the squared sum by one less
than the total number of records. In terms of prediction a user could make some guess at
the value of a predictor without knowing anything else just by knowing the mean and also
gain some basic sense of how variable the guess might be based on the variance.
Linear regression
In statistics prediction is usually synonymous with regression of some form. There are a
variety of different types of regression in statistics but the basic idea is that a model is
created that maps values from predictors in such a way that the lowest error occurs in
making a prediction. The simplest form of regression is simple linear regression that just
16
contains one predictor and a prediction. The relationship between the two can be mapped
on a two dimensional space and the records plotted for the prediction values along the Y
axis and the predictor values along the X axis. The simple linear regression model then
could be viewed as the line that minimized the error rate between the actual prediction
value and the point on the line (the prediction from the model). Graphically this would
look as it does in Figure 1.3. The simplest form of regression seeks to build a predictive
model that is a line that maps between each predictor value to a prediction value. Of the
many possible lines that could be drawn through the data the one that minimizes the
distance between the line and the data points is the one that is chosen for the predictive
model.
On average if you guess the value on the line it should represent an acceptable compromise
amongst all the data at that point giving conflicting answers. Likewise if there is no data
available for a particular input value the line will provide the best guess at a reasonable
answer based on similar data.
Figure 1.3 Linear regression is similar to the task of finding the line that minimizes the
total distance to a set of data.
The predictive model is the line shown in Figure 1.3. The line will take a given value for a
predictor and map it into a given value for a prediction. The actual equation would look
something like: Prediction = a + b * Predictor. Which is just the equation for a line Y = a
+ bX. As an example for a bank the predicted average consumer bank balance might equal
17
$1,000 + 0.01 * customer’s annual income. The trick, as always with predictive modeling,
is to find the model that best minimizes the error. The most common way to calculate the
error is the square of the difference between the predicted value and the actual value.
Calculated this way points that are very far from the line will have a great effect on moving
the choice of line towards themselves in order to reduce the error. The values of a and b in
the regression equation that minimize this error can be calculated directly from the data
relatively quickly.
18
A simple example of nearest neighbor
A simple example of the nearest neighbor prediction algorithm is that if you look at the
people in your neighborhood (in this case those people that are in fact geographically near
to you). You may notice that, in general, you all have somewhat similar incomes. Thus if
your neighbor has an income greater than $100,000 chances are good that you too have a
high income. Certainly the chances that you have a high income are greater when all of
your neighbors have incomes over $100,000 than if all of your neighbors have incomes of
$20,000. Within your neighborhood there may still be a wide variety of incomes possible
among even your “closest” neighbors but if you had to predict someone’s income based
on only knowing their neighbors you’re best chance of being right would be to predict the
incomes of the neighbors who live closest to the unknown person.
The nearest neighbor prediction algorithm works in very much the same way except that
“nearness” in a database may consist of a variety of factors not just where the person
lives. It may, for instance, be far more important to know which school someone attended
and what degree they attained when predicting income. The better definition of “near”
might in fact be other people that you graduated from college with rather than the people
that you live next to.
Nearest Neighbor techniques are among the easiest to use and understand because they
work in a way similar to the way that people think - by detecting closely matching
examples. They also perform quite well in terms of automation, as many of the algorithms
are robust with respect to dirty data and missing data. Lastly they are particularly adept at
performing complex ROI calculations because the predictions are made at a local level
where business simulations could be performed in order to optimize ROI. As they enjoy
similar levels of accuracy compared to other techniques the measures of accuracy such as
lift are as good as from any other.
19
orange than it is to a tomato and that a Toyota Corolla is closer to a Honda Civic than to a
Porsche. This sense of ordering on many different objects helps us place them in time and
space and to make sense of the world. It is what allows us to build clusters - both in
databases on computers as well as in our daily lives. This definition of nearness that seems
to be ubiquitous also allows us to make predictions.
The nearest neighbor prediction algorithm simply stated is:
Objects that are “near” to each other will have similar prediction values as well. Thus if
you know the prediction value of one of the objects you can predict it for it’s nearest
neighbors.
20
The way that this problem is solved for both nearest neighbor techniques and for some
other types of prediction algorithms is to create training records by taking, for instance, 10
consecutive stock prices and using the first 9 as predictor values and the 10th as the
prediction value. Doing things this way, if you had 100 data points in your time series you
could create 10 different training records.
You could create even more training records than 10 by creating a new record starting at
every data point. For instance in the you could take the first 10 data points and create a
record. Then you could take the 10 consecutive data points starting at the second data
point, then the 10 consecutive data point starting at the third data point. Even though
some of the data points would overlap from one record to the next the prediction value
would always be different. In our example of 100 initial data points 90 different training
records could be created this way as opposed to the 10 training records created via the
other method.
1.2.1.4. Clustering
21
Prizm™
Living Off the Land Middle-Poor School Age Low Equifax
Families MicroVision™
University USA Very low Young - Medium to Equifax
Mix High MicroVision™
Sunset Years Medium Seniors Medium Equifax
MicroVision™
This clustering information is then used by the end user to tag the customers in their
database. Once this is done the business user can get a quick high level view of what is
happening within the cluster. Once the business user has worked with these codes for some
time they also begin to build intuitions about how these different customers clusters will
react to the marketing offers particular to their business. For instance some of these
clusters may relate to their business and some of them may not. But given that their
competition may well be using these same clusters to structure their business and
marketing offers it is important to be aware of how you customer base behaves in regard to
these clusters.
22
How is clustering like the nearest neighbor technique?
The nearest neighbor algorithm is basically a refinement of clustering in the sense that they
both use distance in some feature space to create either structure in the data or predictions.
The nearest neighbor algorithm is a refinement since part of the algorithm usually is a way
of automatically determining the weighting of the importance of the predictors and how the
distance will be measured within the feature space. Clustering is one special case of this
where the importance of each predictor is considered to be equivalent.
If these were your friends rather than your customers (hopefully they could be both) and
they were single, you might cluster them based on their compatibility with each other.
Creating your own mini dating service. If you were a pragmatic person you might cluster
your database as follows because you think that marital happiness is mostly dependent on
financial compatibility and create three clusters as shown in Table 1.4.
23
ID Name Prediction Age Balance Income Eyes Gender
24
Figure 2.1 A decision tree is a predictive model that makes a prediction on the basis of a
series of decision much like the game of 20 questions.
• It divides up the data on each branch point without losing any of the data (the
number of total records in a given parent node is equal to the sum of the records
contained in its two children).
• The number of churners and non-churners is conserved as you move up or down
the tree
• It is pretty easy to understand how the model is being built (in contrast to the
models from neural networks or from standard statistics).
• It would also be pretty easy to use this model if you actually had to target those
customers that are likely to churn with a targeted marketing offer.
You may also build some intuitions about your customer base. E.g. “customers who have
been with you for a couple of years and have up to date cellular phones are pretty loyal”.
25
Viewing decision trees as segmentation with a purpose
From a business perspective decision trees can be viewed as creating a segmentation of the
original dataset (each segment would be one of the leaves of the tree). Segmentation of
customers, products, and sales regions is something that marketing managers have been
doing for many years. In the past this segmentation has been performed in order to get a
high level view of a large amount of data - with no particular reason for creating the
segmentation except that the records within each segmentation were somewhat similar to
each other.
In this case the segmentation is done for a particular reason - namely for the prediction of
some important piece of information. The records that fall within each segment fall there
because they have similarity with respect to the information being predicted - not just that
they are similar - without similarity being well defined. These predictive segments that are
derived from the decision tree also come with a description of the characteristics that
define the predictive segment. Thus the decision trees and the algorithms that create them
may be complex, the results can be presented in an easy to understand way that can be
quite useful to the business user.
26
Where can decision trees be used?
Decision trees are data mining technology that has been around in a form very similar to
the technology of today for almost twenty years now and early versions of the algorithms
date back in the 1960s. Often times these techniques were originally developed for
statisticians to automate the process of determining which fields in their database were
actually useful or correlated with the particular problem that they were trying to
understand. Partially because of this history, decision tree algorithms tend to automate the
entire process of hypothesis generation and then validation much more completely and in a
much more integrated way than any other data mining techniques. They are also
particularly adept at handling raw data with little or no pre-processing. Perhaps also
because they were originally developed to mimic the way an analyst interactively performs
data mining they provide a simple to understand predictive model based on rules (such as
“90% of the time credit card customers of less than 3 months who max out their credit
limit are going to default on their credit card loan.”).
Because decision trees score so highly on so many of the critical features of data mining
they can be used in a wide variety of business problems for both exploration and for
prediction. They have been used for problems ranging from credit card attrition prediction
to time series prediction of the exchange rate of different international currencies. There
are also some problems where decision trees will not do as well. Some very simple
problems where the prediction is just a simple multiple of the predictor can be solved much
more quickly and easily by linear regression. Usually the models to be built and the
interactions to be detected are much more complex in real world problems and this is
where decision trees excel.
27
“IF customer lifetime < 1.1 years AND sales channel = telesales THEN chance of churn is
65%.
28
other. Thus the question: “Are you over 40?” probably does not sufficiently distinguish
between those who are churners and those who are not - let’s say it is 40%/60%. On the
other hand there may be a series of questions that do quite a nice job in distinguishing
those cellular phone customers who will churn and those who won’t. Maybe the series of
questions would be something like: “Have you been a customer for less than a year, do you
have a telephone that is more than two years old and were you originally landed as a
customer via telesales rather than direct sales?” This series of questions defines a segment
of the customer population in which 90% churn. These are then relevant questions to be
asking in relation to predicting churn.The process in decision tree algorithms is very
similar when they build trees. These algorithms look at all possible distinguishing
questions that could possibly break up the original training dataset into segments that are
nearly homogeneous with respect to the different classes being predicted. Some decision
tree algorithms may use heuristics in order to pick the questions or even pick them at
random. CART picks the questions in a very unsophisticated way: It tries them all. After
it has tried them all CART picks the best one uses it to split the data into two more
organized segments and then again asks all possible questions on each of those new
segments individually.
If the decision tree algorithm just continued growing the tree like this it could conceivably
create more and more questions and branches in the tree so that eventually there was only
one record in the segment. To let the tree grow to this size is both computationally
expensive but also unnecessary. Most decision tree algorithms stop growing the tree when
one of three criteria are met:
• The segment contains only one record. (There is no further question that you could
ask which could further refine a segment of just one.)
• All the records in the segment have identical characteristics. (There is no reason to
continue asking further questions segmentation since all the remaining records are
the same.)
• The improvement is not substantial enough to warrant making the split.
Consider the following example shown in Table 2.1 of a segment that we might want to
split further which has just two examples. It has been created out of a much larger
29
customer database by selecting only those customers aged 27 with blue eyes and salaries
between $80,000 and $81,000.
Name Age Eyes Salary Churned?
Steve 27 Blue $80,000 Yes
Alex 27 Blue $80,000 No
Table 2.1 Decision tree algorithm segment. This segment cannot be split further except
by using the predictor "name".
In this case all of the possible questions that could be asked about the two customers turn
out to have the same value (age, eyes, salary) except for name. It would then be possible
to ask a question like: “Is the customer’s name Steve?” and create the segments which
would be very good at breaking apart those who churned from those who did not:
The problem is that we all have an intuition that the name of the customer is not going to
be a very good indicator of whether that customer churns or not. It might work well for
this particular 2 record segment but it is unlikely that it will work for other customer
databases or even the same customer database at a different time. This particular example
has to do with overfitting the model - in this case fitting the model too closely to the
idiosyncrasies of the training data. This can be fixed later on but clearly stopping the
building of the tree short of either one record segments or very small segments in general is
a good idea.
After the tree has been grown to a certain size (depending on the particular stopping
criteria used in the algorithm) the CART algorithm has still more work to do. The
algorithm then checks to see if the model has been overfit to the data. It does this in
several ways using a cross validation approach or a test set validation approach. Basically
using the same mind numbingly simple approach it used to find the best questions in the
first place - namely trying many different simpler versions of the tree on a held aside test
set. The tree that does the best on the held aside data is selected by the algorithm as the
best model. The nice thing about CART is that this testing and selection is all an integral
part of the algorithm as opposed to the after the fact approach that other techniques use.
30
1.2.2.3. Neural Networks
Neural Network
When data mining algorithms are talked about these days most of the time people are
talking about either decision trees or neural networks. Of the two neural networks have
probably been of greater interest through the formative stages of data mining technology.
As we will see neural networks do have disadvantages that can be limiting in their ease of
use and ease of deployment, but they do also have some significant advantages. Foremost
among these advantages is their highly accurate predictive models that can be applied
across a large number of different types of problems.
To be more precise with the term “neural network” one might better speak of an “artificial
neural network”. True neural networks are biological systems (a k a brains) that detect
patterns, make predictions and learn. The artificial ones are computer programs
implementing sophisticated pattern detection and machine learning algorithms on a
computer to build predictive models from large historical databases. Artificial neural
networks derive their name from their historical development which started off with the
premise that machines could be made to “think” if scientists found ways to mimic the
structure and functioning of the human brain on the computer. Thus historically neural
networks grew out of the community of Artificial Intelligence rather than from the
discipline of statistics. Despite the fact that scientists are still far from understanding the
human brain let alone mimicking it, neural networks that run on computers can do some of
the things that people can do.
It is difficult to say exactly when the first “neural network” on a computer was built.
During World War II a seminal paper was published by McCulloch and Pitts which first
outlined the idea that simple processing units (like the individual neurons in the human
brain) could be connected together in large networks to create a system that could solve
difficult problems and display behavior that was much more complex than the simple
pieces that made it up. Since that time much progress has been made in finding ways to
apply artificial neural networks to real world prediction problems and in improving the
performance of the algorithm in general. In many respects the greatest breakthroughs in
neural networks in recent years have been in their application to more mundane real world
problems like customer response prediction or fraud detection rather than the loftier goals
31
that were originally set out for the techniques such as overall human learning and computer
speech and image understanding.
Because of the origins of the techniques and because of some of their early successes the
techniques have enjoyed a great deal of interest. To understand how neural networks can
detect patterns in a database an analogy is often made that they “learn” to detect these
patterns and make better predictions in a similar way to the way that human beings do.
This view is encouraged by the way the historical training data is often supplied to the
network - one record (example) at a time. Neural networks do “learn” in a very real sense
but under the hood the algorithms and techniques that are being deployed are not truly
different from the techniques found in statistics or other data mining algorithms. It is for
instance, unfair to assume that neural networks could outperform other techniques because
they “learn” and improve over time while the other techniques are static. The other
techniques if fact “learn” from historical examples in exactly the same way but often times
the examples (historical records) to learn from a processed all at once in a more efficient
manner than neural networks which often modify their model one record at a time.
A common claim for neural networks is that they are automated to a degree where the user
does not need to know that much about how they work, or predictive modeling or even the
database in order to use them. The implicit claim is also that most neural networks can be
unleashed on your data straight out of the box without having to rearrange or modify the
data very much to begin with.
Just the opposite is often true. There are many important design decisions that need to be
made in order to effectively use a neural network such as:
There are also many important steps required for preprocessing the data that goes into a
neural network - most often there is a requirement to normalize numeric data between 0.0
and 1.0 and categorical predictors may need to be broken up into virtual predictors that are
0 or 1 for each value of the original categorical predictor. And, as always, understanding
what the data in your database means and a clear definition of the business problem to be
32
solved are essential to ensuring eventual success. The bottom line is that neural networks
provide no short cuts.
33
The first tactic has seemed to work quite well because when the technique is used for a
well defined problem many of the difficulties in preprocessing the data can be automated
(because the data structures have been seen before) and interpretation of the model is less
of an issue since entire industries begin to use the technology successfully and a level of
trust is created. There are several vendors who have deployed this strategy (e.g. HNC’s
Falcon system for credit card fraud prediction and Advanced Software Applications
ModelMAX package for direct marketing).
Packaging up neural networks with expert consultants is also a viable strategy that avoids
many of the pitfalls of using neural networks, but it can be quite expensive because it is
human intensive. One of the great promises of data mining is, after all, the automation of
the predictive modeling process. These neural network consulting teams are little different
from the analytical departments many companies already have in house. Since there is not
a great difference in the overall predictive accuracy of neural networks over standard
statistical techniques the main difference becomes the replacement of the statistical expert
with the neural network expert. Either with statistics or neural network experts the value of
putting easy to use tools into the hands of the business end user is still not achieved.
34
toward creating clusters that compete against each other for the records that they contain,
thus ensuring that the clusters overlap as little as possible.
35
at a higher level of features such as trees, mountains etc. In either case your friend
eventually gets all the information that they need in order to know what the picture looks
like, but certainly describing it in terms of high level features requires much less
communication of information than the “paint by numbers” approach of describing the
color on each square millimeter of the image.
If we think of features in this way, as an efficient way to communicate our data, then
neural networks can be used to automatically extract them. The neural network shown in
Figure 2.2 is used to extract features by requiring the network to learn to recreate the input
data at the output nodes by using just 5 hidden nodes. Consider that if you were allowed
100 hidden nodes, that recreating the data for the network would be rather trivial - simply
pass the input node value directly through the corresponding hidden node and on to the
output node. But as there are fewer and fewer hidden nodes, that information has to be
passed through the hidden layer in a more and more efficient manner since there are less
hidden nodes to help pass along the information.
Figure 2.2 Neural networks can be used for data compression and feature extraction.
In order to accomplish this the neural network tries to have the hidden nodes extract
features from the input nodes that efficiently describe the record represented at the input
layer. This forced “squeezing” of the data through the narrow hidden layer forces the
neural network to extract only those predictors and combinations of predictors that are best
at recreating the input record. hidden nodes are effectively creating features that are
combinations of the input nodes values.
36
Chapter 5
APPLICATIONS OF DATA MINING
A wide range of companies have deployed successful applications of data mining. While
early adopters of this technology have tended to be in information-intensive industries such
as financial services and direct mail marketing, the technology is applicable to any
company looking to leverage a large data warehouse to better manage their customer
relationships. Two critical factors for success with data mining are: a large, well-integrated
data warehouse and a well-defined understanding of the business process within which
data mining is to be applied (such as customer prospecting, retention, campaign
management, and so on).
• A pharmaceutical company can analyze its recent sales force activity and their
results to improve targeting of high-value physicians and determine which
marketing activities will have the greatest impact in the next few months. The data
needs to include competitor market activity as well as information about the local
health care systems. The results can be distributed to the sales force via a wide-area
network that enables the representatives to review the recommendations from the
perspective of the key attributes in the decision process. The ongoing, dynamic
analysis of the data warehouse allows best practices from throughout the
organization to be applied in specific sales situations.
• A credit card company can leverage its vast warehouse of customer transaction data
to identify customers most likely to be interested in a new credit product. Using a
small test mailing, the attributes of customers with an affinity for the product can
be identified. Recent projects have indicated more than a 20-fold decrease in costs
for targeted mailing campaigns over conventional approaches.
• A diversified transportation company with a large direct sales force can apply data
mining to identify the best prospects for its services. Using data mining to analyze
its own customer experience, this company can build a unique segmentation
37
identifying the attributes of high-value prospects. Applying this segmentation to a
general business database such as those provided by Dun & Bradstreet can yield a
prioritized list of prospects by region.
• A large consumer package goods company can apply data mining to improve its
sales process to retailers. Data from consumer panels, shipments, and competitor
activity can be applied to understand the reasons for brand and store switching.
Through this analysis, the manufacturer can select promotional strategies that best
reach their target customer segments.
Each of these examples have a clear common ground. They leverage the knowledge about
customers implicit in a data warehouse to reduce costs and improve the value of customer
relationships. These organizations can now focus their efforts on the most important
(profitable) customers and prospects, and design targeted marketing strategies to best reach
them.
Games
Since the early 1960s, with the availability of oracles for certain combinatorial games, also
called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board dots-
and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and hex; a
new area for data mining has been opened up. This is the extraction of human-usable
strategies from these oracles. Current pattern recognition approaches do not seem to fully
have the required high level of abstraction in order to be applied successfully. Instead,
extensive experimentation with the tablebases, combined with an intensive study of
tablebase-answers to well designed problems and with knowledge of prior art, i.e. pre-
tablebase knowledge, is used to yield insightful patterns. Berlekamp in dots-and-boxes etc.
and John Nunn in chess endgames are notable examples of researchers doing this work,
though they were not and are not involved in tablebase generation.
38
Business
Data Mining is a highly effective tool in the catalog marketing industry. Catalogers have a
rich history of customer transactions on millions of customers dating back several years.
Data mining tools can identify patterns among customers and help identify the most likely
customers to respond to upcoming mailing campaigns.
39
Science and engineering
In recent years, data mining has been widely used in area of science and engineering, such
as bioinformatics, genetics, medicine, education and electrical power engineering.
In the area of study on human genetics, the important goal is to understand the mapping
relationship between the inter-individual variation in human DNA sequences and
variability in disease susceptibility. In lay terms, it is to find out how the changes in an
individual's DNA sequence affect the risk of developing common diseases such as cancer.
This is very important to help improve the diagnosis, prevention and treatment of the
diseases. The data mining technique that is used to perform this task is known as
multifactor dimensionality reduction.[14]
In the area of electrical power engineering, data mining techniques have been widely used
for condition monitoring of high voltage electrical equipment. The purpose of condition
monitoring is to obtain valuable information on the insulation's health status of the
equipment. Data clustering such as self-organizing map (SOM) has been applied on the
vibration monitoring and analysis of transformer on-load tap-changers(OLTCS). Using
vibration monitoring, it can be observed that each tap change operation generates a signal
that contains information about the condition of the tap changer contacts and the drive
mechanisms. Obviously, different tap positions will generate different signals. However,
there was considerable variability amongst normal condition signals for the exact same tap
position. SOM has been applied to detect abnormal conditions and to estimate the nature of
the abnormalities.[15]
Data mining techniques have also been applied for dissolved gas analysis (DGA) on power
transformers. DGA, as a diagnostics for power transformer, has been available for many
years. Data mining techniques such as SOM has been applied to analyse data and to
determine trends which are not obvious to the standard DGA ratio techniques such as
Duval Triangle.[15]
40
Future Scope and Work
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of
store scanner data — and mining a mountain for a vein of valuable ore. Both processes
require either sifting through an immense amount of material, or intelligently probing it to
find exactly where the value resides. Given databases of sufficient size and quality, data
mining technology can generate new business opportunities by providing these
capabilities:
41
References
42
12. Ellen Monk, Bret Wagner (2006). Concepts in Enterprise Resource Planning,
Second Edition. Thomson Course Technology, Boston, MA. ISBN 0-619-21663-8.
OCLC 224465825.
13. Tony Fountain, Thomas Dietterich & Bill Sudyka (2000) Mining IC Test Data to
Optimize VLSI Testing, in Proceedings of the Sixth ACM SIGKDD International
Conference on Knowledge Discovery & Data Mining. (pp. 18-25). ACM Press.
14. Xingquan Zhu, Ian Davidson (2007). Knowledge Discovery and Data Mining:
Challenges and Realities. Hershey, New Your. pp. 18. ISBN 978-159904252-7.
15. A.J. McGrail, E. Gulski et al.. "Data Mining Techniques to Asses the Condition of
High Voltage Electrical Plant". CIGRE WG 15.11 of Study Committee 15.
16. R. Baker. "Is Gaming the System State-or-Trait? Educational Data Mining Through
the Multi-Contextual Application of a Validated Behavioral Model". Workshop on
Data Mining for User Modeling 2007.
17. J.F. Superby, J-P. Vandamme, N. Meskens. "Determination of factors influencing
the achievement of the first-year university students using data mining methods".
Workshop on Educational Data Mining 2006.
18. Xingquan Zhu, Ian Davidson (2007). Knowledge Discovery and Data Mining:
Challenges and Realities. Hershey, New York. pp. 163–189. ISBN 978-
159904252-7.
19. ibid. pp. 31–48.
20. Yudong Chen, Yi Zhang, Jianming Hu, Xiang Li. "Traffic Data Analysis Using
Kernel PCA and Self-Organizing Map". Intelligent Vehicles Symposium, 2006
IEEE.
21. Bate A, Lindquist M, Edwards IR, Olsson S, Orre R, Lansner A, De Freitas RM. A
Bayesian neural network method for adverse drug reaction signal generation. Eur J
Clin Pharmacol. 1998 Jun;54(4):315-21.
22. Norén GN, Bate A, Hopstadius J, Star K, Edwards IR. Temporal Pattern Discovery
for Trends and Transient Effects: Its Application to Patient Records. Proceedings
of the Fourteenth International Conference on Knowledge Discovery and Data
Mining SIGKDD 2008, pages 963-971. Las Vegas NV, 2008.
43
23. Healey, R., 1991, Database Management Systems. In Maguire, D., Goodchild,
M.F., and Rhind, D., (eds.), Geographic Information Systems: Principles and
Applications (London: Longman).
24. Câmara, A. S. and Raper, J., (eds.), 1999, Spatial Multimedia and Virtual Reality,
(London: Taylor and Francis).
25. Miller, H. and Han, J., (eds.), 2001, Geographic Data Mining and Knowledge
Discovery, (London: Taylor & Francis).
26. Government Accountability Office, Data Mining: Early Attention to Privacy in
Developing a Key DHS Program Could Reduce Risks, GAO-07-293, Washington,
D.C.: February 2007.
27. Secure Flight Program report, MSNBC.
28. "Total/Terrorism Information Awareness (TIA): Is It Truly Dead?". Electronic
Frontier Foundation (official website). 2003.
http://w2.eff.org/Privacy/TIA/20031003_comments.php. Retrieved 2009-03-15.
44