Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006
Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006
Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006
Objectives
What is data mining?
Why data mining?
What applications?
What techniques?
What process?
What software?
27 Novembe
GKGupta
Definition
Data mining may be defined as follows:
data mining is a collection of techniques for
efficient automated discovery of previously
unknown, valid, novel, useful and understandable
patterns in large databases. The patterns must be
actionable so they may be used in an enterprises
decision making.
27 Novembe
GKGupta
27 Novembe
GKGupta
Examples
amazon.com uses associations. Recommendations
to customers are based on past purchases and what
other customers are purchasing.
A store in USA Just for Feet has about 200 stores,
each carrying up to 6000 shoe styles, each style in
several sizes. Data mining is used to find the right
shoes to stock in the right store.
More examples in case studies to be discussed
later.
27 Novembe
GKGupta
Data Mining
We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
Although data mining is possible with smaller
amount of data, bigger the data, higher the
confidence in any unknown pattern that is
discovered.
There is considerable hype about data mining at the
present time and Gartner Group has listed data
mining as one of the top ten technologies to watch.
Question: How many books could one store in one Terabyte of memory?
27 Novembe
GKGupta
27 Novembe
GKGupta
Information explosion
27 Novembe
GKGupta
Information explosion
27 Novembe
GKGupta
Information explosion
Many adults in India generate:
Mobile phone transactions. More than 300 million
phones in India, reportedly growing at the rate of
10,000 new ones every hour! Mobile companies
must save information about calls.
Growing middle class with growing number of
credit and debit card transactions. About 25m
credit cards and 70m debit cards in 2007. Annual
growth rate about 30% and 40% respectively.
Could be 55m credit cards and 200m debit cards in
2010 resulting in perhaps 500m transactions
annually.
27 Novembe
GKGupta
10
Information explosion
India has some huge enterprises, for example
Indian railways, perhaps the busiest network in the
world with 2.5m employees, 10,000 locomotives,
10,000 passenger trains daily, 10,000 freight
trains daily and 20m passengers daily.
Growing airline traffic with more than ten airlines.
Perhaps 30m passengers annually.
Growing number of motor vehicles registration,
insurance, driver license
Internet surfing records
27 Novembe
GKGupta
11
OLTP
As noted earlier, most enterprise database systems
were designed in the 1970s or 1980s and were
mainly designed to automate some of the office
procedures e.g. order entry, student enrolment,
patient registration, airline reservations. These are
well structured repetitive operations easily
automated.
27 Novembe
GKGupta
12
Decision Making
Need for business memory and intelligence.
Need to serve customers better by learning from
past interactions.
OLTP data is not a good basis for maintaining an
enterprise memory.
The intelligence hidden in data could be the secret
weapon in a competitive business world but given
the information explosion not even a small fraction
could be looked at by human eye.
Question: Why OLTP is not good for maintaining an enterprise memory?
27 Novembe
GKGupta
13
27 Novembe
GKGupta
14
Operational vs Management
View
Operational
Decision-Support
Users Management
Daytoday work
Decision support
Application oriented
Subject oriented
Current data
Historical data
Detailed
Simple queries
Complex queries
Predetermined queries
Ad hoc queries
Update/Select
Only Select
Realtime
Not realtime
27 Novembe
GKGupta
15
Evolution of Technology
Corporate data growth accompanied by decline in
the cost of storage and processing.
PC motherboard performance, measured in MHz/$,
is currently doubling every 27 2 months.
Next slide using logarithmic scale shows that disk
is now about 10GB per US dollar and the following
slide shows that sales of disk storage is growing
exponentially.
Look at computing trends at
http://www.zoology.ubc.ca/~rikblok/ComputingTren
ds/
Question: How much is the cost of 100GB disk? What is the cost of a PC and
what is its CPU performance?
27 Novembe
GKGupta
16
27 Novembe
GKGupta
17
27 Novembe
GKGupta
18
Evolution of Technology
Question: What do the graphs in the last two slides tell us? What scales are
used in them? What was the pink line is the first graph?
27 Novembe
GKGupta
19
Evolution of Technology
Database technology has improved over the
years.
Data collection is often much better and cheaper
now
The need for analyzing and synthesizing
information is growing in a fiercely competitive
business environment of today.
27 Novembe
GKGupta
20
New applications
Sophisticated applications of modern enterprises
include:
- sales forecasting and analysis
- marketing and promotion planning
- business modeling
OLTP is not designed for such applications. Also,
large enterprises operate a number of database
systems and then it is necessary to integrate
information for decision making applications.
Question: Why OLTP cannot be used for sales forecasting and analysis?
27 Novembe
GKGupta
21
27 Novembe
GKGupta
22
27 Novembe
GKGupta
23
Growth of cards
A recent survey in USA found that the
percentages of US adults using the following
types of cards were:
Credit cards - 88%;
ATM cards - 60%
Membership cards - 58%
Debit cards - 35%
Prepaid cards - 35%
Loyalty cards - 29%
Question: What kind of data do these cards generate?
27 Novembe
GKGupta
24
27 Novembe
GKGupta
25
Algorithms
A variety of statistical and learning algorithms
have been available in fields like statistics and
artificial intelligence that have been adapted for
data mining.
With new focus on data mining, new algorithms
are being developed.
27 Novembe
GKGupta
26
Availability of Software
Large variety of DM software is now available.
Some more widely used software is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER
27 Novembe
GKGupta
27
Strong Business
Competition
Growth in service economies. Almost every
business is a service business. Service
economies are information rich and very
competitive.
Consider the telecommunications environment in
Australia. About 20 years ago, Telstra was a
monopoly. The field is now very competitive.
Mobile phone market in India is also very
competitive.
27 Novembe
GKGupta
28
Applications
In finance, telecom, insurance and retail:
Loan/credit card approval
market segmentation
fraud detection
better marketing
trend analysis
market basket analysis
customer churn
Web site design and promotion
27 Novembe
GKGupta
29
27 Novembe
GKGupta
30
Market Segmentation
Large amounts of data about customers contains
valuable information
The market may be segmented into many
subgroups according to variables that are good
discriminators
Not always easy to find variables that will help in
market segmentation
27 Novembe
GKGupta
31
Fraud Detection
Very challenging since it is difficult to define
characteristics of fraud. Often based on
detecting changes from the norm.
In statistics, it is common to throw out the
outliers but in data mining it may be useful to
identify them since they could either be due to
errors or perhaps fraud.
27 Novembe
GKGupta
32
Better Marketing
When customers buy new products, other
products may be suggested to them when they
are ready.
As noted earlier, in mail order marketing for
example, one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?
27 Novembe
GKGupta
33
Better Marketing
It has been reported that more than 1000
variable values on each customer are held by
some mail order marketing companies.
The aim is to lift the response rate.
27 Novembe
GKGupta
34
Trend analysis
In a large company, not all trends are always
visible to the management. It is then useful to use
data mining software that will identify trends.
Trends may be long term trends, cyclic trends or
seasonal trends.
27 Novembe
GKGupta
35
27 Novembe
GKGupta
36
Customer Churn
In businesses like telecommunications,
companies are trying very hard to keep their
good customers and to perhaps persuade good
customers of their competitors to switch to them.
In such an environment, businesses want to find
which customers are good, why customers switch
and what makes customers loyal.
Cheaper to develop a retention plan and retain
an old customer than to bring in a new customer.
27 Novembe
GKGupta
37
Customer Churn
The aim is to get to know the customers better
so you will be able to keep them longer.
Given the competitive nature of businesses,
customers will move if not looked after.
Also, some businesses may wish to get rid of
customers that cost more than they are worth
e.g. credit card holders that dont use the card,
bank customers with very small amount of
money in their accounts.
27 Novembe
GKGupta
38
27 Novembe
GKGupta
39
27 Novembe
GKGupta
40
Requirements Analysis
The enterprise decision makers need to formulate
goals that the data mining process is expected to
achieve. The business problem must be clearly
defined. One cannot use data mining without a
good idea of what kind of outcomes the
enterprise is looking for.
If objectives have been clearly defined, it is easier
to evaluate the results of the project.
27 Novembe
GKGupta
41
27 Novembe
GKGupta
42
27 Novembe
GKGupta
43
GKGupta
44
27 Novembe
GKGupta
45
27 Novembe
GKGupta
46
Results Visualisation
Explaining the results of data mining to the
decision makers is an important step. Most DM
software includes data visualisation modules
which should be used in communicating data
mining results to the managers.
Clever data visualisation tools are being
developed to display results that deal with more
than two dimensions. The visualisation tools
available should be tried and used if found
efective for the given problem.
27 Novembe
GKGupta
47
27 Novembe
GKGupta
48
CRISPDM Steps
The six CRISPDM steps are:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment
27 Novembe
GKGupta
49
CRISPDM Steps
The six steps proposed in CRISPDM are similar to
the six steps proposed earlier.
. The CRISDM steps are shown in the following
figure.
Question: Compare the two sets of steps, one given in previous few slides and
the CRISP-DM approach. Which approach is better?
27 Novembe
GKGupta
50
27 Novembe
GKGupta
51
27 Novembe
GKGupta
52
27 Novembe
GKGupta
53
Association analysis
Classification and prediction
Cluster analysis
Web data mining
Search Engines
Data warehouse and OLAP
Others, for example, Sequential patterns and
Time-series analysis, not covered in this book
27 Novembe
GKGupta
54
Association Analysis
Association analysis involves discovery of
relationships or correlations among a set of
items.
Discovering that personal loans are repaid with
80% confidence when the person owns his
home.
The classical example is the one where a store
discovered that people buying nappies tend also
to buy beer.
27 Novembe
GKGupta
55
Associations
The association rules are often written as X Y
meaning that whenever X appears Y also tends to
appear. X and Y may be collection of attributes.
A supermarket like Woolworths may have several
thousand items and many millions of transactions
a week (i.e. Gigabytes of data each week). Note
that the quantities of items bought is ignored.
27 Novembe
GKGupta
56
27 Novembe
GKGupta
57
Cluster Analysis
Similar to classification in that the aim is to build
clusters such that each of them is similar within
itself but is dissimilar to others. Clustering does not
rely on class-labeled data objects.
Based on the principle of maximizing the
intracluster similarity and minimizing the
intercluster similarity.
27 Novembe
GKGupta
58
27 Novembe
GKGupta
59
Search engines
Normally the search engine databases of Web
pages are built and updated automatically by
Web crawlers. When one searches the Web using
one of the search engines, one is not searching
the entire Web. Instead one is only searching the
database that has been compiled by the search
engine. There are a number of challenging
problems related to search engines that are
discussed in Chapter 6 including how to assign a
ranking to each Web page that is retrieved in
response to a user query.
27 Novembe
GKGupta
60
27 Novembe
GKGupta
61
27 Novembe
GKGupta
62
27 Novembe
GKGupta
63
Task-relevant Data
The whole database may not be required since it
may be that we only want to study something
specific e.g. trends in postgraduate students
- countries they come from
- degree program they are doing
- their age?
- time they take to finish the degree
- scholarship they have they been awarded
May need to build a database subset before data
mining can be done.
27 Novembe
GKGupta
64
Task-relevant Data
Data collection is non-trivial.
OLTP data is not useful since it is changing all the
time. In some cases, data from more than one
database may be needed.
27 Novembe
GKGupta
65
Preprocessing
A data mining process would normally involve
preprocessing
Often data mining applications use data
warehousing
One approach is to pre-mine the data, warehouse
it, then carry out data mining
The process is usually iterative and can take
years of efort for a large project
27 Novembe
GKGupta
66
Data Preprocessing
Preprocessing is very important although often
considered too mundane to be taken seriously
Preprocessing may also be needed after the data
warehouse phase
Data reduction may be needed to transform very
high dimensional data to a lower dimensional
data
27 Novembe
GKGupta
67
Data Preprocessing
Feature Selection
Use sampling?
Normalization
Smoothing
Dealing with duplicates, missing data
Dealing with time-dependent data
27 Novembe
GKGupta
68
Background knowledge
Background information may be useful in the
discovery process.
For example, concept hierarchies or relationships
between data may be useful in data mining. For
postgraduate degrees, we may wish to look at all
Masters degrees and all doctorate degrees
separately.
27 Novembe
GKGupta
69
Measuring interest
Data mining process may generate many patterns.
We cannot look at all of them and so need some
way to separate uninteresting results from the
interesting ones.
This may be based on simplicity of pattern, rule
length, or level of confidence.
27 Novembe
GKGupta
70
Visualization
We must be able to display results so that they are
easy to understand.
Display may be a graph, pie chart, tables etc.
Some displays are better than others for a given
kind of knowledge.
27 Novembe
GKGupta
71
27 Novembe
GKGupta
72
27 Novembe
GKGupta
73
27 Novembe
GKGupta
74
27 Novembe
GKGupta
75
Performance
Usability
27 Novembe
GKGupta
76
References
D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT
Press, 2001.
J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001. The Web site for this book is
http://www.cs.sfu.ca/~han/DM_Book.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations, Morgan
Kaufmann, 2000. The Web site for this book is
www.mkp.com/datamining.
Dhar, V. and Stein, R., 1997, Seven methods for transforming
corporate data into business intelligence, Prentice Hall.
27 Novembe
GKGupta
77
References
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1996
M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, 8(6), pp 866-883, 1996.
Berry, M. and Linof, G., 1997, Data mining techniques for
marketing, sales and support, John Wiley & Sons.
Berry, M. and Linof, G., 1999, Mastering data mining, John Wiley
& Sons.
27 Novembe
GKGupta
78