Chapter 04 - in Class
Chapter 04 - in Class
Chapter 4
Data Mining Process, Methods, and
Algorithms
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Learning Objectives
4.1 Define data mining as an enabling technology for
business analytics
4.2 Understand the objectives and benefits of data mining
4.3 Become familiar with the wide range of applications of
data mining
4.4 Learn the standardized data mining processes
4.5 Learn different methods and algorithms of data mining
4.6 Build awareness of the existing data mining software
tools
4.7 Understand the privacy issues, pitfalls, and myths of
data mining
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Opening Vignette
Miami-Dade Police Department Is Using
Predictive Analytics to Foresee and Fight Crime
• Predictive analytics in law enforcement
– Policing with less
– New thinking on cold cases
– The big picture starts small (robbery unit)
– Success brings credibility
– Just for the facts
– Safer streets for smarter cities
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Why Data Mining?
• Recognition of the value in data sources.
• Availability of quality data on customers, vendors,
transactions, Web, etc.
• Consolidation and integration of data repositories into data
warehouses.
• The exponential increase in data processing and storage
capabilities; and decrease in cost.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Definition of Data Mining
• The nontrivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data stored in structured databases.
-- Fayyad et al., (1996)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Is a Blend of Multiple
Disciplines
Figure 4.1 Data Mining Is a Blend of Multiple Disciplines.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Characteristics &
Objectives
• Source of data for DM is often a consolidated data
warehouse (not always!).
• DM environment is usually a client-server or a Web-based
information systems architecture.
• Data is the most critical ingredient for DM which may
include unstructured data.
• The miner is often an end user.
• Striking it rich requires creative thinking.
• Data mining tools’ capabilities and ease of use are
essential (web, parallel processing, etc.)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
How Data Mining Works
• DM extract patterns from data
– Pattern? A mathematical (numeric and/or symbolic)
relationship among data items
• Types of patterns
– Association: commonly co-occurring things
– Prediction: future occurrences of certain events
prediction
– Clustering (Segmentation): natural grouping of things
– Sequential relationships: time-ordered events
discovery
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Figure 4.2 A Simple Taxonomy for Data Mining Tasks, Methods, and Algorithms.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining versus Statistics
• Are they the same?
– Same: Relationships within data
– Difference
Statistics: well-defined hypothesis with manageable
dataset size
Data Mining: loosely defined discovery statement
for patterns, lots of data; often used as a model for
future events
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (1 of 4)
• Customer Relationship Management
– Maximize return on marketing campaigns
– Improve customer retention (churn analysis)
– Maximize customer value (cross-, up-selling)
– Identify and treat most valued customers
• Banking & Other Financial
– Automate the loan application process
– Detecting fraudulent transactions
– Maximize customer value (cross-, up-selling)
– Optimizing cash reserves with forecasting
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (2 of 4)
• Retailing and Logistics
– Optimize inventory levels at different locations
– Improve the store layout and sales promotions
– Optimize logistics by predicting seasonal effects
– Minimize losses due to limited shelf life
• Manufacturing and Maintenance
– Predict/prevent machinery failures
– Identify anomalies in production systems to optimize
the use manufacturing capacity
– Discover novel patterns to improve product quality
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (3 of 4)
• Brokerage and Securities Trading
– Predict changes on certain bond prices
– Forecast the direction of stock fluctuations
– Assess the effect of events on market movements
– Identify and prevent fraudulent activities in trading
• Insurance
– Forecast claim costs for better business planning
– Determine optimal rate plans
– Optimize marketing to specific customers
– Identify and prevent fraudulent claim activities
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Applications (4 of 4)
• Computer hardware and software
• Science and engineering
• Government and defense
• Homeland security and law enforcement
• Travel, entertainment, sports
• Healthcare and medicine
• Sports,… virtually everywhere…
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process
• A systematic way to conduct DM projects
• Moving from Art to Science for DM project
• Most common standard processes:
– CRISP-DM (Cross-Industry Standard Process for Data
Mining)
– SEMMA (Sample, Explore, Modify, Model, and
Assess)
– KDD (Knowledge Discovery in Databases)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: CRISP-DM
(1 of 2)
• Cross Industry Standard Process for Data Mining
• Proposed in 1990s by a European consortium
• Composed of six consecutive steps
– Step 1: Business Understanding Accounts for
– Step 2: Data Understanding ~85% of total
– Step 3: Data Preparation project time
The above steps involve Descriptive Analytics, or exploratory data analysis (EDA)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: CRISP-DM
(2 of 2)
• Figure 4.3 The Six-
Step CRISP-DM Data
Mining Process.
• The process is highly
repetitive and
experimental
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: SEMMA
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Process: KDD
Figure 4.6 KDD (Knowledge Discovery in Databases)
Process.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Which Data Mining Process is the
Best?
Figure 4.7 Ranking of Data Mining Methodologies/Processes.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Assessment Methods for
Classification
• Predictive accuracy
– Hit rate
• Speed
– Model building versus predicting/usage speed
• Robustness
– Performance on noisy, missing, or error data
• Scalability
– Performance on large amount of data
• Interpretability
– Insights provided by the model
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Accuracy of Classification Models
• In classification problems, the primary source for accuracy
estimation is the confusion matrix
TP TN
Accuracy
TP TN FP FN
TP
Precision
TP FP
TN
True Positive Rate
TP True Negative Rate
TP FN TN FP
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Estimation Methodologies for
Classification: Simple Split
• Simple split (or holdout or test sample estimation)
– Split the data into 2 mutually exclusive sets: training
(~70%) and testing (30%)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Estimation Methodologies for
Classification: k-Fold Cross
Validation
• Data is split into k mutual subsets and k number training/testing
experiments are conducted
Figure 4.10 A Graphical Depiction of k-Fold Cross-Validation.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Additional Estimation Methodologies
for Validation
• Leave-one-out
– Similar to k-fold and testing on each data point
• Bootstrapping
– Random sampling with replacement
• Jackknifing
– Similar to leave-one-out, accuracy counted with one
sample out
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Area Under the ROC Curve (AUC)
• ROC curve: plotting the true positive rate on Y and false
positive rate on X
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Estimating the Relative Importance
of Predictor Variables
• Sensitivity analysis
– Relative discernibility
– Input value perturbation
– Leave one out experiments
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees (1 of 2)
• Employs a divide-and-conquer method
• Recursively divides a training set until each division consists of
examples from one class (as possible)
• A general algorithm (steps) can be:
1. Create a root node and assign all of the training
data to it.
2. Select the best splitting attribute.
3. Add a branch to the root node for each value of
the split. Split the data into mutually exclusive
subsets along the lines of the specific split.
4. Repeat the steps 2 and 3 for each and every leaf
node until the stopping criteria is reached.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Decision Trees (2 of 2)
• DT algorithms mainly differ on
1. Splitting criteria
Which variable, what value, etc.
Best attribute to split for purifying the class
representation (e.g. Gini index, information gain)
2. Stopping criteria
When to stop building the tree
3. Pruning (generalization method)
Pre-pruning versus post-pruning
• Most popular DT algorithms include
– ID3, C4.5, C5; CART; CHAID; M5
Example in RapidMiner – Hotel App Customer Churn
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
• Used for automatic identification of natural groupings of
things
• Part of the machine-learning family
• Employ unsupervised learning
• Learns the clusters of things from past data, then assigns
new instances
• There is NO output/target variable
– In marketing, it is also known as segmentation
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
• Clustering results may be used to
– Identify natural groupings of customers
– Identify rules for assigning new cases to classes for
targeting/diagnostic purposes
– Provide characterization, definition, labeling of
populations
– Decrease the size and complexity of problems for
other data mining methods
– Identify outliers in a specific domain (e.g., rare-event
detection)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
• Analysis methods
– Statistical methods, such as k-means, k-modes…
– Neural networks (self-organizing map)
– Fuzzy logic
– Genetic algorithms
• How many clusters?
– Determine the optimal number of clusters
• General approach
– Divisive (start with one cluster and then broken apart)
– Agglomerative (start as individual cluster and the
joined)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining
• k-Means Clustering Algorithm
– k: pre-determined number of clusters
– Algorithm (Step 0: determine value of k)
Step 1: Randomly generate k random points as initial
cluster centers.
Step 2: Assign each point to the nearest cluster center.
Step 3: Re-compute the new cluster centers.
Repetition step: Repeat steps 3 and 4 until some
convergence criterion is met (usually when the
assignment of points to clusters becomes stable).
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Cluster Analysis for Data Mining -
k-Means Clustering Algorithm
Figure 4.13 A Graphical Illustration of the Steps in the
k-Means Algorithm.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
• Input: the simple point-of-sale transaction data
• Output: Most frequent affinities among items
• Example: according to the transaction data…
“Customer who bought a lap-top computer and a virus
protection software, also bought extended service plan
70 percent of the time."
• How do you use such a pattern/knowledge?
– Put the items next to each other
– Promote the items as a package
– Place items far apart from each other!
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
• Also named “Market-basket Analysis”
• Applications
– Sales transactions
– Credit card transactions
– Banking services
– Insurance service products
– Telecommunication services
– Medical records
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining
• Apriori Algorithm
– Finds subsets that are common to at least a minimum
number of the item sets (i.e. the minimum support)
– Uses a bottom-up approach to extend frequent item
sets one item a time
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Association Rule Mining Apriori
Algorithm
Figure 4.14 A Graphical Illustration of Frequent Itemsets in
the Apriori Algorithm.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Software Tools
Figure 4.15 Popular Data Mining Software Tools (Poll Results).
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 4.6 (2 of 4)
Data Mining Goes to Hollywood: Predicting
Financial Success of Movies
A Typical Classification Problem
Table 4.3 Movie Classification
Class No. 1 2 3 4 5 6 7 8 9
Range <1 >1 >10 >20 >40 >65 >100 >150 >200
(in millions of dollars) (Flop) <10 <20 <40 <65 <100 <150 <200 (Blockbuster)
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Application Case 4.6 (3 of 4)
Data Mining Goes to Hollywood: Predicting
Financial Success of Movies
FIGURE 4.16 Process
Flow Screenshot for the
Box-Office Prediction
System.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Myths
Table 4.6 Data Mining Myths.
Myth Reality
Data mining provides instant, crystal-ball-like Data mining is a multistep process that
predictions. requires deliberate, proactive design
and use.
Data mining is not yet viable for mainstream The current state of the art is ready for
business applications. almost any business type and/or size.
Data mining requires a separate, dedicated Because of the advances in database
database. technology, a dedicated database is
not required.
Only those with advanced degrees can do Newer Web-based tools enable
data mining. managers of all educational levels to
do data mining.
Data mining is only for large firms that have If the data accurately reflect the
lots of customer data. business or its customers, any
company can use data mining.
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Data Mining Mistakes
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data mining is and
what it really can/cannot do
3. Beginning without the end in mind.
4. Not leaving insufficient time for data acquisition,
selection and preparation
5. Looking only at aggregated results and not at individual
records/predictions
6. … 10 more mistakes… in your book
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved
Copyright
Copyright © 2020, 2015, 2011 Pearson Education, Inc. All Rights Reserved