Introduction To Data Mining With Case Studies Author: G. K. Gupta Prentice Hall India, 2006

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 78
At a glance
Powered by AI
The key takeaways are that data mining involves discovering patterns in large datasets, it is used widely by businesses for applications like recommendations and inventory management, and factors like team structure, pilot projects, and management support are important for success.

Data mining involves the automated discovery of previously unknown, valid, and useful patterns in large datasets. It aims to find patterns that can be used for decision making and future predictions.

Examples given include Amazon using past purchase data to provide recommendations, and a shoe store using data mining to determine optimal inventory levels for each store location.

Introduction

Introduction to Data Mining with Case Studies


Author: G. K. Gupta
Prentice Hall India, 2006.

Objectives
What is data mining?
Why data mining?
What applications?
What techniques?
What process?
What software?

27 Novembe

GKGupta

Definition
Data mining may be defined as follows:
data mining is a collection of techniques for
efficient automated discovery of previously
unknown, valid, novel, useful and understandable
patterns in large databases. The patterns must be
actionable so they may be used in an enterprises
decision making.

27 Novembe

GKGupta

What is Data Mining?


Efficient automated discovery of previously
unknown patterns in large volumes of data.
Patterns must be valid, novel, useful and
understandable.
Businesses are mostly interested in discovering
past patterns to predict future behaviour.
A data warehouse, to be discussed later, can be an
enterprises memory. Data mining can provide
intelligence using that memory.

27 Novembe

GKGupta

Examples
amazon.com uses associations. Recommendations
to customers are based on past purchases and what
other customers are purchasing.
A store in USA Just for Feet has about 200 stores,
each carrying up to 6000 shoe styles, each style in
several sizes. Data mining is used to find the right
shoes to stock in the right store.
More examples in case studies to be discussed
later.

27 Novembe

GKGupta

Data Mining
We assume we are dealing with large data, perhaps
Gigabytes, perhaps in Terabytes.
Although data mining is possible with smaller
amount of data, bigger the data, higher the
confidence in any unknown pattern that is
discovered.
There is considerable hype about data mining at the
present time and Gartner Group has listed data
mining as one of the top ten technologies to watch.

Question: How many books could one store in one Terabyte of memory?

27 Novembe

GKGupta

Why Data Mining Now?


Growth in generation and storage of corporate
data information explosion
Need for sophisticated decision making current
database systems are Online Transaction
Processing (OLTP) systems. The OLTP data is
difficult to use for such applications. Why?
Evolution of technology much cheaper storage,
easier data collection, better database
management, to data analysis and
understanding.

27 Novembe

GKGupta

Information explosion

Database systems are being used since the


1960s in the Western countries (perhaps since
1980s in India). These systems have generated
mountains of data.
Point of sale terminals and bar codes on many
products, railway bookings, educational
institutions, huge number of mobile phones,
electronic commerce, all generate data.
Government is now collecting a lot of
information.

27 Novembe

GKGupta

Information explosion

Internet banking via networked computers and


ATMs.
Credit and debit cards.
Medical data, doctors, hospitals.
Transportation, Indian railways, automatic toll
collection on toll roads, growing air travel.
Passports, NRI visas, Other visas, NRI money
transfers.

Question: Can you think of other examples of data collection?

27 Novembe

GKGupta

Information explosion
Many adults in India generate:
Mobile phone transactions. More than 300 million
phones in India, reportedly growing at the rate of
10,000 new ones every hour! Mobile companies
must save information about calls.
Growing middle class with growing number of
credit and debit card transactions. About 25m
credit cards and 70m debit cards in 2007. Annual
growth rate about 30% and 40% respectively.
Could be 55m credit cards and 200m debit cards in
2010 resulting in perhaps 500m transactions
annually.
27 Novembe

GKGupta

10

Information explosion
India has some huge enterprises, for example
Indian railways, perhaps the busiest network in the
world with 2.5m employees, 10,000 locomotives,
10,000 passenger trains daily, 10,000 freight
trains daily and 20m passengers daily.
Growing airline traffic with more than ten airlines.
Perhaps 30m passengers annually.
Growing number of motor vehicles registration,
insurance, driver license
Internet surfing records

27 Novembe

GKGupta

11

OLTP
As noted earlier, most enterprise database systems
were designed in the 1970s or 1980s and were
mainly designed to automate some of the office
procedures e.g. order entry, student enrolment,
patient registration, airline reservations. These are
well structured repetitive operations easily
automated.

27 Novembe

GKGupta

12

Decision Making
Need for business memory and intelligence.
Need to serve customers better by learning from
past interactions.
OLTP data is not a good basis for maintaining an
enterprise memory.
The intelligence hidden in data could be the secret
weapon in a competitive business world but given
the information explosion not even a small fraction
could be looked at by human eye.
Question: Why OLTP is not good for maintaining an enterprise memory?

27 Novembe

GKGupta

13

OLTP vs Decision Making


Clerical view of data focuses on details required
for day-to-day running of an enterprise.
Management view of data focuses on summary
data to identify trends, challenges and
opportunities.
The detailed data view is the operational view
while the management view is decision-support
view. Comparison of the two views:

27 Novembe

GKGupta

14

Operational vs Management
View

Operational

Decision-Support

Users Admin staf

Users Management

Daytoday work

Decision support

Application oriented

Subject oriented

Current data

Historical data

Detailed

Overall view summaries

Simple queries

Complex queries

Predetermined queries

Ad hoc queries

Update/Select

Only Select

Realtime

Not realtime

27 Novembe

GKGupta

15

Evolution of Technology
Corporate data growth accompanied by decline in
the cost of storage and processing.
PC motherboard performance, measured in MHz/$,
is currently doubling every 27 2 months.
Next slide using logarithmic scale shows that disk
is now about 10GB per US dollar and the following
slide shows that sales of disk storage is growing
exponentially.
Look at computing trends at
http://www.zoology.ubc.ca/~rikblok/ComputingTren
ds/
Question: How much is the cost of 100GB disk? What is the cost of a PC and
what is its CPU performance?

27 Novembe

GKGupta

16

Decline in Hard Drive cost

27 Novembe

GKGupta

17

Growth in Worldwide Disk


Capacity

27 Novembe

GKGupta

18

Evolution of Technology

Question: What do the graphs in the last two slides tell us? What scales are
used in them? What was the pink line is the first graph?

27 Novembe

GKGupta

19

Evolution of Technology
Database technology has improved over the
years.
Data collection is often much better and cheaper
now
The need for analyzing and synthesizing
information is growing in a fiercely competitive
business environment of today.

27 Novembe

GKGupta

20

New applications
Sophisticated applications of modern enterprises
include:
- sales forecasting and analysis
- marketing and promotion planning
- business modeling
OLTP is not designed for such applications. Also,
large enterprises operate a number of database
systems and then it is necessary to integrate
information for decision making applications.
Question: Why OLTP cannot be used for sales forecasting and analysis?

27 Novembe

GKGupta

21

Why Data Mining Now?


As noted earlier, the reasons may be summarized
as:
Accumulation of large amounts of data
Increased afordable computing power enabling
data mining processing
Statistical and learning algorithms
Availability of software
Strong business competition

27 Novembe

GKGupta

22

Large amount of data


Already discussed that many enterprises have
large amounts of data accumulated over 30+
years.
Noted earlier that some enterprises collect
information for analysis, for example,
supermarkets in USA ofer loyalty cards in
exchange for shopper information. Loyalty cards in
Australia also collect information using a reward
system.

27 Novembe

GKGupta

23

Growth of cards
A recent survey in USA found that the
percentages of US adults using the following
types of cards were:
Credit cards - 88%;
ATM cards - 60%
Membership cards - 58%
Debit cards - 35%
Prepaid cards - 35%
Loyalty cards - 29%
Question: What kind of data do these cards generate?

27 Novembe

GKGupta

24

Afordable computing power


Data mining is usually computationally intensive.
Dramatic reduction in the price of computer
systems, as noted earlier, is making it possible to
carry out data mining without investing huge
amounts of resources in hardware and software.
In spite of afordable computing power, using
data mining can be resources intensive.

27 Novembe

GKGupta

25

Algorithms
A variety of statistical and learning algorithms
have been available in fields like statistics and
artificial intelligence that have been adapted for
data mining.
With new focus on data mining, new algorithms
are being developed.

27 Novembe

GKGupta

26

Availability of Software
Large variety of DM software is now available.
Some more widely used software is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER

27 Novembe

GKGupta

27

Strong Business
Competition
Growth in service economies. Almost every
business is a service business. Service
economies are information rich and very
competitive.
Consider the telecommunications environment in
Australia. About 20 years ago, Telstra was a
monopoly. The field is now very competitive.
Mobile phone market in India is also very
competitive.

27 Novembe

GKGupta

28

Applications
In finance, telecom, insurance and retail:
Loan/credit card approval
market segmentation
fraud detection
better marketing
trend analysis
market basket analysis
customer churn
Web site design and promotion
27 Novembe

GKGupta

29

Loan/Credit card approvals


In a modern society, a bank does not know its
customers. Only knowledge a bank has is their
information stored in the computer.
Credit agencies and banks collect a lot of
customers behavioural data from many sources.
This information is used to predict the chances of
a customer paying back a loan.

27 Novembe

GKGupta

30

Market Segmentation
Large amounts of data about customers contains
valuable information
The market may be segmented into many
subgroups according to variables that are good
discriminators
Not always easy to find variables that will help in
market segmentation

27 Novembe

GKGupta

31

Fraud Detection
Very challenging since it is difficult to define
characteristics of fraud. Often based on
detecting changes from the norm.
In statistics, it is common to throw out the
outliers but in data mining it may be useful to
identify them since they could either be due to
errors or perhaps fraud.

27 Novembe

GKGupta

32

Better Marketing
When customers buy new products, other
products may be suggested to them when they
are ready.
As noted earlier, in mail order marketing for
example, one wants to know:
- will the customer respond?
- will the customer buy and how much?
- will the customer return purchase?
- will the customer pay for the purchase?

27 Novembe

GKGupta

33

Better Marketing
It has been reported that more than 1000
variable values on each customer are held by
some mail order marketing companies.
The aim is to lift the response rate.

27 Novembe

GKGupta

34

Trend analysis
In a large company, not all trends are always
visible to the management. It is then useful to use
data mining software that will identify trends.
Trends may be long term trends, cyclic trends or
seasonal trends.

27 Novembe

GKGupta

35

Market Basket Analysis


Aims to find what the customers buy and what
they buy together
This may be useful in designing store layouts or
in deciding which items to put on sale
Basket analysis can also be used for
applications other than just analysing what
items customers buy together

27 Novembe

GKGupta

36

Customer Churn
In businesses like telecommunications,
companies are trying very hard to keep their
good customers and to perhaps persuade good
customers of their competitors to switch to them.
In such an environment, businesses want to find
which customers are good, why customers switch
and what makes customers loyal.
Cheaper to develop a retention plan and retain
an old customer than to bring in a new customer.

27 Novembe

GKGupta

37

Customer Churn
The aim is to get to know the customers better
so you will be able to keep them longer.
Given the competitive nature of businesses,
customers will move if not looked after.
Also, some businesses may wish to get rid of
customers that cost more than they are worth
e.g. credit card holders that dont use the card,
bank customers with very small amount of
money in their accounts.

27 Novembe

GKGupta

38

Web site design


A Web site is efective only if the visitors easily
find what they are looking for.
Data mining can help discover affinity of visitors
to pages and the site layout may be modified
based on this information.

27 Novembe

GKGupta

39

Data Mining Process


Successful data mining involves careful
determining the aims and selecting appropriate
data.
The following steps should normally be followed:
1. Requirements analysis
2. Data selection and collection
3. Cleaning and preparing data
4. Data mining exploration and validation
5. Implementing, evaluating and monitoring
6. Results visualisation

27 Novembe

GKGupta

40

Requirements Analysis
The enterprise decision makers need to formulate
goals that the data mining process is expected to
achieve. The business problem must be clearly
defined. One cannot use data mining without a
good idea of what kind of outcomes the
enterprise is looking for.
If objectives have been clearly defined, it is easier
to evaluate the results of the project.

27 Novembe

GKGupta

41

Data Selection and


Collection
Find the best source databases for the data that
is required. If the enterprise has implemented a
data warehouse, then most of the data could be
available there. Otherwise source OLTP systems
need to be identified and required information
extracted and stored in some temporary system.
In some cases, only a sample of the data
available may be required.

27 Novembe

GKGupta

42

Cleaning and Preparing


Data
This may not be an onerous task if a data warehouse
containing the required data exists, since most of this
must have already been done when data was loaded
in the warehouse.
Otherwise this task can be very resource intensive,
perhaps more than 50% of efort in a data mining
project is spent on this step. Essentially a data store
that integrates data from a number of databases may
need to be created. When integrating data, one often
encounters problems like identifying data, dealing
with missing data, data conflicts and ambiguity. An
ETL (extraction, transformation and loading) tool may
be used to overcome these problems.

27 Novembe

GKGupta

43

Exploration and Validation


Assuming that the user has access to one or
more data mining tools, a data mining model
may be constructed based on the enterprises
needs. It may be possible to take a sample of
data and apply a number of relevant techniques.
For each technique the results should be
evaluated and their significance interpreted.
This is likely to be an iterative process which
should lead to selection of one or more
techniques that are suitable for further
exploration, testing and validation.
27 Novembe

GKGupta

44

Implementing, Evaluating and


Monitoring
Once a model has been selected and validated,
the model can be implemented for use by the
decision makers. This may involve software
development for generating reports or for results
visualisation and explanation for managers.
If more than one technique is available for the
given data mining task, it is necessary to
evaluate the results and choose the best. This
may involve checking the accuracy and
efectiveness of each technique.

27 Novembe

GKGupta

45

Implementing, Evaluating and


Monitoring
Regular monitoring of the performance of the
techniques that have been implemented is
required. Every enterprise evolves with time and
so must the data mining system. Monitoring may
from time to time to lead to the refinement of
tools and techniques that have been
implemented.

27 Novembe

GKGupta

46

Results Visualisation
Explaining the results of data mining to the
decision makers is an important step. Most DM
software includes data visualisation modules
which should be used in communicating data
mining results to the managers.
Clever data visualisation tools are being
developed to display results that deal with more
than two dimensions. The visualisation tools
available should be tried and used if found
efective for the given problem.

27 Novembe

GKGupta

47

Data Mining Process Another


Approach
The last few slides presented one approach.
Another approach that also includes six steps has
been proposed by CRISPDM (CrossIndustry
Standard Process for Data Mining) developed by
an industry consortium.
The six steps are:

27 Novembe

GKGupta

48

CRISPDM Steps
The six CRISPDM steps are:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modelling
5. Evaluation
6. Deployment

27 Novembe

GKGupta

49

CRISPDM Steps
The six steps proposed in CRISPDM are similar to
the six steps proposed earlier.
. The CRISDM steps are shown in the following
figure.

Question: Compare the two sets of steps, one given in previous few slides and
the CRISP-DM approach. Which approach is better?

27 Novembe

GKGupta

50

CRISP Data Mining Model

27 Novembe

GKGupta

51

Data Mining Techniques


Although data mining is a new field, it uses many
techniques developed years ago in other fields
Machine learning, statistics, artificial intelligence,
etc
These techniques are in some cases modified to
deal with large amounts of data

27 Novembe

GKGupta

52

Data Mining Techniques


Data mining includes a large number of techniques
including concept/class description, association
analysis, classification and prediction, cluster
analysis, outlier analysis etc.
Expression and visualization of data mining results is
a challenging task.
Privacy issues also need to be considered.

27 Novembe

GKGupta

53

Data Mining Tasks

Association analysis
Classification and prediction
Cluster analysis
Web data mining
Search Engines
Data warehouse and OLAP
Others, for example, Sequential patterns and
Time-series analysis, not covered in this book

27 Novembe

GKGupta

54

Association Analysis
Association analysis involves discovery of
relationships or correlations among a set of
items.
Discovering that personal loans are repaid with
80% confidence when the person owns his
home.
The classical example is the one where a store
discovered that people buying nappies tend also
to buy beer.

27 Novembe

GKGupta

55

Associations
The association rules are often written as X Y
meaning that whenever X appears Y also tends to
appear. X and Y may be collection of attributes.
A supermarket like Woolworths may have several
thousand items and many millions of transactions
a week (i.e. Gigabytes of data each week). Note
that the quantities of items bought is ignored.

27 Novembe

GKGupta

56

Classification and Prediction


A set of training objects each with a number of
attribute values are given to the classifier. The
classifier formulates rules for each class in the
training set so that the rules may be used to
classify new objects. Some techniques do not
require training data.
Classification may be used for predicting the class
label of data objects. Number of techniques
including decision tree and neural network.

27 Novembe

GKGupta

57

Cluster Analysis
Similar to classification in that the aim is to build
clusters such that each of them is similar within
itself but is dissimilar to others. Clustering does not
rely on class-labeled data objects.
Based on the principle of maximizing the
intracluster similarity and minimizing the
intercluster similarity.

27 Novembe

GKGupta

58

Web data mining


The Web revolution has had a profound impact on
the way we search and find information at home
and at work. From its beginning in the early
1990s, the web has grown to more than ten
billion pages in 2008 (estimates vary), perhaps
even more by the time you are looking at this
slide. Web usage, Web content and Web structure
are discussed in Chapter 5.

27 Novembe

GKGupta

59

Search engines
Normally the search engine databases of Web
pages are built and updated automatically by
Web crawlers. When one searches the Web using
one of the search engines, one is not searching
the entire Web. Instead one is only searching the
database that has been compiled by the search
engine. There are a number of challenging
problems related to search engines that are
discussed in Chapter 6 including how to assign a
ranking to each Web page that is retrieved in
response to a user query.
27 Novembe

GKGupta

60

Data Warehousing and OLAP


Data warehousing is a process by which an
enterprise collects data from the whole enterprise to
build a single version of the truth. This information
is useful for decision makers and may also be used
for data mining. A data warehouse can be of real
help in data mining since data cleaning and other
problems of collecting data would have already
been overcome.
OLAP (Online Analytical Processing) tools are
decision support tools that are often built on top of a
data warehouse or another database. OLAP goes
further than traditional query and report tools in
that a decision maker already has a hypothesis
which he/she is trying to test.

27 Novembe

GKGupta

61

Data Warehousing and OLAP


Data mining is somewhat diferent than OLAP
since in data mining a hypothesis is not being
tested. Instead data mining is used to uncover
novel patterns in the data.

27 Novembe

GKGupta

62

Before Data Mining


To define a data mining task, one needs to answer
the following:
What data set do I want to mine?
What kind of knowledge do I want to mine?
What background knowledge could be useful?
How do I measure if the results are interesting?
How do I display what I have discovered?

27 Novembe

GKGupta

63

Task-relevant Data
The whole database may not be required since it
may be that we only want to study something
specific e.g. trends in postgraduate students
- countries they come from
- degree program they are doing
- their age?
- time they take to finish the degree
- scholarship they have they been awarded
May need to build a database subset before data
mining can be done.

27 Novembe

GKGupta

64

Task-relevant Data
Data collection is non-trivial.
OLTP data is not useful since it is changing all the
time. In some cases, data from more than one
database may be needed.

27 Novembe

GKGupta

65

Preprocessing
A data mining process would normally involve
preprocessing
Often data mining applications use data
warehousing
One approach is to pre-mine the data, warehouse
it, then carry out data mining
The process is usually iterative and can take
years of efort for a large project

27 Novembe

GKGupta

66

Data Preprocessing
Preprocessing is very important although often
considered too mundane to be taken seriously
Preprocessing may also be needed after the data
warehouse phase
Data reduction may be needed to transform very
high dimensional data to a lower dimensional
data

27 Novembe

GKGupta

67

Data Preprocessing
Feature Selection
Use sampling?
Normalization
Smoothing
Dealing with duplicates, missing data
Dealing with time-dependent data

27 Novembe

GKGupta

68

Background knowledge
Background information may be useful in the
discovery process.
For example, concept hierarchies or relationships
between data may be useful in data mining. For
postgraduate degrees, we may wish to look at all
Masters degrees and all doctorate degrees
separately.

27 Novembe

GKGupta

69

Measuring interest
Data mining process may generate many patterns.
We cannot look at all of them and so need some
way to separate uninteresting results from the
interesting ones.
This may be based on simplicity of pattern, rule
length, or level of confidence.

27 Novembe

GKGupta

70

Visualization
We must be able to display results so that they are
easy to understand.
Display may be a graph, pie chart, tables etc.
Some displays are better than others for a given
kind of knowledge.

27 Novembe

GKGupta

71

Guidelines for Successful Data


Mining
The data must be available
The data must be relevant, adequate and clean
There must be a well-defined problem
The problem should not be solvable by means of
ordinary query or OLAP tools
The results must be actionable

27 Novembe

GKGupta

72

Guidelines for Successful Data


Mining
1.
2.
3.

Use a small team with a strong internal integration


and a loose management style.
Carry out a small pilot project before a major data
mining project.
Identify a clear problem owner responsible for the
project. Could be someone in a sales or marketing.
This will benefit the external integration.

Question: Why each of the above guidelines is important for success?

27 Novembe

GKGupta

73

Guidelines for Successful Data


Mining
4.
5.

Try to realise a positive return on investment


within 6 to 12 months.
The whole data mining project should have the
support of the top management of the company.

Question: Why each of the above guidelines is important for success?

27 Novembe

GKGupta

74

Data Mining Software


As noted earlier, a large variety of DM software is
now available. Some more widely used software
is:
IBM - Intelligent Miner and more
SAS - Enterprise Miner
Silicon Graphics - MineSet
Oracle - Thinking Machines - Darwin
Angoss - knowledgeSEEKER

27 Novembe

GKGupta

75

Choosing Data Mining Software


Many factors need to be considered if purchasing significant
software:

Product and vendor information

Total cost of ownership

Performance

Functionality and modularity

Training and support

Reporting facilities and visualization

Usability

Question: Which one of the above is the most important? Why?

27 Novembe

GKGupta

76

References
D. Hand, H. Mannila and P. Smyth, Principles of Data Mining, MIT
Press, 2001.
J. Han and M. Kamber, Data Mining: Concepts and Techniques,
Morgan Kaufmann, 2001. The Web site for this book is
http://www.cs.sfu.ca/~han/DM_Book.
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations, Morgan
Kaufmann, 2000. The Web site for this book is
www.mkp.com/datamining.
Dhar, V. and Stein, R., 1997, Seven methods for transforming
corporate data into business intelligence, Prentice Hall.

27 Novembe

GKGupta

77

References
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy
(eds.), Advances in Knowledge Discovery and Data Mining,
AAAI/MIT Press, 1996
M.S. Chen, J. Han, and P.S. Yu, Data Mining: An Overview from a
Database Perspective, IEEE Transactions on Knowledge and Data
Engineering, 8(6), pp 866-883, 1996.
Berry, M. and Linof, G., 1997, Data mining techniques for
marketing, sales and support, John Wiley & Sons.
Berry, M. and Linof, G., 1999, Mastering data mining, John Wiley
& Sons.

27 Novembe

GKGupta

78

You might also like