Chapter 4 SR2023
Chapter 4 SR2023
A Managerial Perspective on
Analytics (3rd Edition)
Chapter 4:
Data Mining
Learning Objectives
Define data mining as an enabling technology for
business intelligence
Understand the objectives and benefits of business
analytics and data mining
Recognize the wide range of applications of data
mining
Learn the standardized data mining processes
CRISP-DM
SEMMA
KDD
(Continued…)
Copyright © 2014 Pearson Education, Inc. Slide 4-
1- 2
Learning Objectives
Understand the steps involved in data
preprocessing for data mining
Learn different methods and algorithms of data
mining
Build awareness of the existing data mining
software tools
Commercial versus free/open source
Understand the pitfalls and myths of data mining
Pattern
Recognition
DATA Machine
MINING Learning
Mathematical
Modeling Databases
Types of patterns
Association
Prediction
Cluster (segmentation)
Sequential (or time series) relationships
Copyright © 2014 Pearson Education, Inc. Slide 4- 11
A Taxonomy for Data Mining Tasks
Data Mining Learning Method Popular Algorithms
Types of DM
Hypothesis-driven data mining
Discovery-driven data mining
Insurance
Forecast claim costs for better business planning
Determine optimal rate plans
Optimize marketing to specific customers
Identify and prevent fraudulent claim activities
Source: KDNuggets.com
Copyright © 2014 Pearson Education, Inc. Slide 4- 19
Data Mining Process: CRISP-DM
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
· Collect data
Data Consolidation · Select data
· Integrate data
· Normalize data
Data Transformation · Discretize/aggregate data
· Construct new attributes
Well-formed
Data
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
TP FN
Count (TP) Count (FP)
TN
True Negative Rate
TN FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
P recision Recall
TP FP TP FN
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
0.9
0.8
A
True Positive Rate (Sensitivity)
0.7
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
Axon
Axon
Dendrites Neuron
Neuron
Biological x1 Artificial NN
versus Inputs
w1 Y1
Artificial x2
w2 Processing
Outputs
Neural .
Element (PE)
n
f (S )
Y Y2
. S X iW
Networks Weights
i
Transfer .
. .
i 1
Function .
Summation
wn Yn
Biological Artificial
xn
Neuron Node (or PE)
Dendrites Input
Axon Output
Synapse Weight
Slow Fast
Many (109) Few (102)
Elements/Concepts of ANN
Processing element (PE)
Information processing
Network structure
Feedforward vs. recurrent vs. multi-layer…
Learning parameters
Supervised/unsupervised, backpropagation,
learning rate, momentum
ANN Software – NN shells, integrated modules
in comprehensive DM software, …
Copyright © 2014 Pearson Education, Inc. Slide 4- 49
SPSS PASW Modeler (formerly Clementine)
Software
Microsoft Excel
MATLAB
IBM SPSS Modeler Other commercial tools
Oracle DM
Viscovery
Weka Clario Analytics
Total (w/ others) Alone
R, … Miner3D
Thinkanalytics
0 20 40 60 80 100 120
Source: KDNuggets.com
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Independent Variable Possible Values
Values
Depende
nt MPAA Rating 5 G, PG, PG-13, R, NR
Variable Independe Competition 3 High, Medium, Low
nt Star value 3 High, Medium, Low
Variables
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary
The DM
process
Process
Map in Model
IBM Assessment
process
SPSS
Modele
r
Questions, comments