Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
Predictive Analytics I: Data Mining: Process, Methods, and Algorithms
Discussion Questions
1. Why do law enforcement agencies and departments like
Miami-Dade Police Department embrace advanced
analytics and data mining?
2. What are the top challenges for law enforcement
agencies and departments like Miami-Dade Police
Department? Can you think of other challenges (not
mentioned in this case) that can benefit from data
mining?
Opening Vignette (3 of 3)
Prediction
Association
Segmentation
• Time-series forecasting
– Part of the sequence or link analysis?
• Visualization
– Another data mining task?
• Data Mining versus Statistics
– Are they the same?
– What is the relationship between the two?
Data Mining Applications (1 of 4)
• The process is highly repetitive and experimental (DM: art versus science?)
1 2
Business Data
Understanding Understanding
3
Data
Preparation
6
4
Deployment
Model
Data
Building
5
Testing and
Evaluation
Data Mining Process: SEMMA
• Figure 4.5 SEMMA Data Mining Process
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
Feedback
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
Data Mining Process: KDD
• Figure 4.6 KDD (Knowledge Discovery in Databases) Process
Internalization
Data Mining
DEPLOYMENT CHART
Knowledge
“Actionable
PHASE 1 PHASE 2 PHASE 3 PHASE 4 PHASE 5
DEPT 1
DEPT 2
DEPT 3
DEPT 4
Data 1 2 3 4 5 Insight”
Transformation
Extracted
Patterns
Data
Cleaning Transformed
Data
Data
Selection Preprocessed
Data
Target
Data
Feedback
Sources for
Raw Data
Which Data Mining Process is the Best?
• Figure 4.7 Ranking of Data Mining Methodologies/Processes.
CRISP-DM
My own
SEMMA
KDD Process
My organization's
Domain-specific methodology
None
0 10 20 30 40 50 60 70
• Predictive accuracy
– Hit rate
• Speed
– Model building versus predicting/usage speed
• Robustness
• Scalability
• Interpretability
– Transparency, explainability
Accuracy of Classification Models
• In classification problems, the primary source for accuracy
estimation is the confusion matrix
TP + TN
Accuracy = True/Observed Class
TP + TN + FP + FN
Positive Negative
TP
True PositiveRate =
Positive
True False
TP + FN
Predicted Class
Positive Positive
Count (TP) Count (FP)
TN
True NegativeRate =
TN + FP
Negative
False True
TP TP Negative Negative
Precision = Recall = Count (FN) Count (TN)
TP + FP TP + FN
Estimation Methodologies for
Classification: Single/Simple Split
Model
Training Data Development
2/3
Trained Prediction
Preprocessed Classifier Accuracy
Data
1/3 Model TP FP
Assessment
Testing Data (scoring) FN TN
• Leave-one-out
– Similar to k-fold where k = number of samples
• Bootstrapping
– Random sampling with replacement
• Jackknifing
– Similar to leave-one-out
• Area Under the ROC Curve (AUC)
– ROC: Receiver Operating Characteristics (a term
borrowed from radar image processing)
Area Under the ROC Curve (AUC) (1 of 2)
• Works with binary classification
• Figure 4.11 A Sample ROC Curve
Area Under the ROC Curve (AUC) (2 of 2)
to 1.0 0.9
0.8
0.5
is 1.0 0.4
Area Under the
ROC Curve
(AUC) A = 0.84
0.3
• Analysis methods
– Statistical methods (including both hierarchical and
nonhierarchical), such as k-means, k-modes, and so
on.
– Neural networks (adaptive resonance theory [ART],
self-organizing map [SOM])
– Fuzzy logic (e.g., fuzzy c-means algorithm)
– Genetic algorithms
• How many clusters?
Cluster Analysis for Data Mining (4 of 4)
1001234 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1001235 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1001236 2, 3 3 4 1, 4 3
1001237 1, 2, 4 4 5 2, 3 4
1001238 1, 2, 3, 4 2, 4 5
1001239 2, 4 3, 4 3
Data Mining Software Tools
R 1,419
Python 1,325
SQL 1,029
• Commercial Excel
RapidMiner
972
944
Hadoop 641
Dependent Variable
Class No. 1 2 3 4 5 6 7 8 9
Range >1 >1 > > 20 > 40 > 65 > 100 > 150 > 200
(in (Flop > 10 10 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
$Millions) ) <
20
Application Case 4.6 (3 of 5)
Independent Variables
Number of
Independent Variable Possible Values
Values
MPAA Rating 5 G, PG, PG-13, R, NR
Competition 3 High, Medium, Low
Star value 3 High, Medium, Low
Genre Sci-Fi, Historic Epic Drama, Modern
Drama, Politically Related, Thriller,
10
Horror, Comedy, Cartoon, Action,
Documentary
Special effects 3 High, Medium, Low
Sequel 2 Yes, No
Number of screens 1 Positive integer
Application Case 4.6 (4 of 5)
The DM Process Map in IBM SPSS Modeler
Model
Development
process
Model
Assessment
process
Application Case 4.6 (5 of 5)
Myth Reality
Data mining provides instant, crystal-ball-like Data mining is a multistep process that requires
predictions. deliberate, proactive design and use.
Data mining is not yet viable for mainstream The current state of the art is ready to go for
business applications. almost any business type and/or size.
Data mining requires a separate, dedicated Because of the advances in database technology,
database. a dedicated database is not required.
Only those with advanced degrees can do data Newer Web-based tools enable managers of all
mining. educational levels to do data mining.
Data mining is only for large firms that have lots of If the data accurately reflect the business or its
customer data. customers, any company can use data mining.
Data Mining Mistakes